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THE METHOD OF PATH COEFFICIENTS 
By 


SEWALL WRIGHT 
Department of Zoology, The University of Chicago. 


Introduction 


The method of path coefficients was suggested a number of 
years ago (Wright 1918, more fully 1920, 1921), as a flexible 
means of relating the correlation coefficients between variables 
in a multiple system to the functional relations among them. The 
method has been applied in quite a variety of cases. It seems. 
desirable now to make a restatement of the theory and to review 
the types of application, especially as there has been a certain 
amount of misunderstanding both of purpose and of procedure. 


Basic Formulae 
The object of investigation is a system of variable quantities, 


arranged in a typically branching sequential order representative 
of some chosen point of view toward the functional relations. 


Such a system is conveniently represented in a diagram such as 

Fig. 1. Those variables which are treated as ‘dependent are con- 

nected with those of which they are con- 

sidered functions by arrows. The system 

of factors back of each variable may be 

made formally complete by the introduc- 

tion of symbols representative of, total 

residual determination (as V, in Fig. 1), 

A residual correlation between variables 

is represented by a double-headed arrow. 

It will be assumed that all relations are Fic. 1 

linear! Thus each variable is related to those from which uni-_ 
1 Relations which are far from linear with respect to the absolute’ 


values of the variables may be approximately linear with respect to varia- 
tions, if the coefficients of variability are small. Thus if = f (Vv Ve V,)» 


43 @ 9 
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directional arrows are drawn to it by an equation of the follow- 

ing type, where (y- Vv), «, -V, )\-- etc. represent deviations 

from the means and < 9 Co etc. are the coefficients. 

(1) (y,-%)-= <6, (v,-V)+ 4, 04-¥%)+: +4, (Y-¥) 
It is convenient to measure the deviation of each variable by 

its standard deviation. Let X= Sa X = y-¥ etc. , 


= > ’ 


and let Ei = Ly, i % 
- Le £4 4 -og, &,,. 


The coefficients in this form are of the type called path coeffi- 
cients. Each obviously measures the fraction of the standard 
deviation of the dependent variable (with the appropriate sign) 
for which the designated factor is directly responsible, in the 
sense of the fraction which would be found if this factor varies 
to the same extent as in the observed data while all others (in- 
cluding residual factors Xx, ) are constant. This definition (ex- 
cept for determination of sign) can be written as follows, putting 
the constant factors after a dot. 


(3) Y at 1 G,.23...n,u . 6, 
oo” © ink. tcan 


It is sometimes convenient to represent the standard deviation due 
directly to a particular factor by a symbol. The form J,,,,= kr 6 
will be used. Obviously 6),,) = ¢., 9% 
and, neglecting sign, 
Go. 23-:- 7, ye 

(4) LO eee 
Oy. 23-°° %, ie a 

The theorem which makes the path coefficient useful in relat- 
ing correlations to functional relation is a very simple one. The 
correlation between V, and any other variable Vg in such a 
system as Fig. 2 can be written in the form 


the relation of small deviations from the mean values are approximately 


the first order terms of an expansion by Taylor’s Theorem. The error 
may be represented by a residual term 7 . 


- 2% SV ay +--- 
a * a + 


2 
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Tig = WEXKg = E Xq (BX 6B Xe PX) 
- ’ ty, + agg 0+ + Eta 
2 ZE-: Agi 
The correlation is thus analyzed into contribu- Vv, 
tions from all of the paths in the diagram to 
(Fig. 2) passing through each factor of one 
of the variables. 

But the correlation terms symbolized by L 
Ag, may be capable of analysis by application ’ 
of this same formula. By repeated analysis of Fic. 2 
this sort, as far as the diagram (such as Fig. 1) permits, we are 
led to the following principle: Any correlation between variables 
in a network of sequential relations can be analyzed into con- 
tributions from all of the paths (direct or through common fac- 
tors) by which the two variables are connected, such that the 
value of each contribution is the product of the coefficients per- 
taining to the elementary paths. If residual correlations are 
present (represented by bidirectional arrows) one (but never 
more than one) of the coefficients thus multiplied together to give 
the contribution of a connecting path, may be a correlation coeffi- 
cient. The others are all path coefficients. 

In tracing connecting paths it is obvious that one may trace 
back along the arrows and then forward as well as directly from 
one variable to the other (perhaps through intervening variables) 
but never forward and then back. That two factors affect the 
same dependent variable does not contribute to the correlation 
between them. Similarly two variables which are correlated with 
a third are not necessarily correlated with each other. As illus- 
trations of these principles consider the correlations between some 
of the variables in Fig. 1. 

, 4, =o A+ es 


36 46 13 Ss 56 36 


i! + BP. A.-=PrA-P +P PP +P 2: P 


23 56 Zt 26 36 12 4 ¥5 25 15 ‘25 is" % 2&6 
It is sometimes convenient to use an extension of the symbol- 
ism in dealing with compound paths. 
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4, i c . | t P wee + To egee 
Yor + Tig * Gas +f * ag - 


In this symbolism all of the variables along a contributing 
path are listed in proper order. If the path passes through a 
represented common factor, the latter-is indicated by a dot. If it 
involves an unanalyzed correlation the two ultimate correlated 
variables may be indicated by a line as above. . The evaluation of 
such compound path coefficients is obvious, 


a * Ss Ses Sta * © S Se 
A S TGs 25? Frag = fr At, 6, 2 ete. 


It is to be noted that the symbolism does not apply to the 
indicated variables in an absolute sense but is always to be under- 
stood as relative to a particular arrangement of the variables, 
i.e. to a particular point of view with respect to the functional 
relations. 

A special case of equation (5) arises if one correlates a varia- 
ble with itself, taking into account all factors (known and un- 
known) 


(6) Sia ® z f Yea, =. 


00 
This may be put in a form which is usually more convenient 


by further analysisof -%, = Fo + Z Ey Me, 


ol 





‘4 
2 
7 = 
(7) oo, #2 GAY >! 
Degree of Determination 
2 a 
c 
From the formula . = aoe it is obvious that a squared 


°o 


path coefficient measures the portion of the variance of the de- 
pendent variable for which the independent variable is directly 
responsible, under the point of view adopted. The squared path 
coefficient may accordingly be called a coefficient of determina- 
tion. Such coefficients were used before the term path coefficient 
was applied to the square root. (Wright 1918.) 
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The sum of the squared path coefficients is unity only in the 
case in which there are no correlations among the factors. It is 
necessary, therefore, to recognize additional terms measuring the 
changes in variance (positive or negative) due to correlated oc- 
currence of the contributions of such factors ( 2 afl 4 Fey , 
etc. in equation 7). It is tempting to apportion determination 
among the factors by using the terms 7; z,; of equation (6) as 
measures of determination, and this has been done by some au- 
thors, e.g. Kirchevsky (1927) who independently reached a some- 
what similar viewpoint on the interpretation of systems of cor- 
related variables in other respects. No transparent meaning can 
be attached to such expressions (which may be negative). The 
term does not measure direct determination since it involves in- 
direct connections between the variables. Neither does it meas- 
ure total determination, direct and indirect. This is given by the 
squared correlation coefficient. 


The Correlation between Linear Functions 
The most direct application of the method is in the estimation 
of the correlation between two variables which are functions (in 
part at least) of the same variables. Let 4, and V, be two 
variables whose correlation is desired. 
Ve2= <4, +2,, VY re, Vat + SV; 


c c 

. Vr . Ly, + £y,V, $a, We "+ <, ‘ Men” 
(8) Sf. FC T+2 2 4.c - GCEX,, Vz 

Ss SL t& St Shed L¢ i. 

gc « £2. € +2 2% <... 6. ECA; 

7. Te tc Te Ty te One 

om be oe 
Re? a es Fe * ee ae 


(9) oy * = Bi Et Ii “s ky Fic. 3 


As an example, suppose that we wish to find the correlation 
between the compound variables V, and V, where V,= V,+V,+V5 
and V_-V+V+ V, knowing that ¥,%, V3 and V, are all of 


t a2) 
equal variability (% = % = % = 9%) 
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and are independent (2,,> 2° 4%, =%,7 ty =3,70) 
G2 0, = 36," 
Ker = ~*~ ee ee ‘ts 
ay + FB, + By Py = %. 


Again suppose that we wish to estimate the true correlation 
between two variables from that between measurements known 
to be subject to considerable random error. Assume that the cor- 
relation between two measurements of the same variate has been 
found in each case. It is instructive to work this out from two 
different points of view. 

Let A be the mean of m ‘measurements A, 
(A,,4,,°,A,) of A. Let B be the mean of > 
7 measurements (G7,G,-:.,6,) of B. * 

The known correlations are those between 
measures of 4 (2, a)» between measures of as 
B (gg) and between measures of A and B we 
(2,42) - Expressing the complete determina- 
tion of A and B by their components, using Fic. 4 
equation (6): 2 FF, Az, = 7 Fon [14 62-1) et =/ 


2 Fon “tgp * 7 ‘ep | of 


From these ms cere Pr jy 
A mL1+0n-1)Xppx) ? FE Vn 


[147-1 2g.) 
The correlation between 4 and 6 can be written 


(10) 7m. 72 
= 2 a oe poem, ae 
AS\i U+On-)2, al [i+ Gr-1) 2g] 


For indefinitely large values of 77 and 7z the averages may 
be considered as true scores, A, and B_. 


(11) A “as 


— V%na'2oo 


This result can be reached much more directly by the simpler 
set up (Fig. 5) in which the observed measurements are repre- 
sented as functions of the true scores A. , G and of random 
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errors. Note that the directions of the arrows are the reverse of 
those in Fig. 4. 
ai 


wee 
ae 


Pan= Pane giving aay “VA 


giving /- a ° V Miss 


B 
2 Ta B 


AA, 4,4, 6, 8, 
again giving - 7 An B ; 
4, 6, tng “ag Fic. 5 
This formula is, of course, Spearman’s correction for atten- 
uation. The purpose here is to bring out the simple way in which 
such formulae can be obtained by the method of path coefficients. 
The following is a more complex case in which a simple method 


is more essential. 


The Statistical Effects of Inbreeding? 
Assume for simplicity that the ef- 
fects of different genetic factors com- 
bine additively (no dominance or 


epistasis). In Fig. 6, . and = repre- 


sent the genetic constitution of two 

parents and O of their offspring. The 

constitution of the latter (under the Fic. 6 

above assumption and ignoring the possibility of sex linkage) is 
equally and completely determined by the constitutions of the two 
germ cells (G ,G ) which united to produce it. It will be con- 
venient to represent the path coefficients and correlations by 
single letters: 


a= [be = ke. ) +-£,°E,, 


P= me 17 * Ay - 


CG ? P 
The determination of O by G, and G can be expressed in 


2 The purpose in presenting this and later examples is to illustrate 
something of the range of applicability of the method, rather than to give 
a detailed analysis of each case. For the latter the reader must be referred 
to the references cited at the end. 
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the equation 2 a( 1+F)=/ (by equation 7) giving 


[ / 
(12) a= 20+F) i 


Two complementary germ cells 
(G,,G,) Fig. 7, such as could arise ‘(Se _b Gs 
from the same reduction division have - in 
the same relation to the genetic con- 
stitution of the parent (which they Fic. 7 
completely determine in a mathematical sense) as the two germ 
cells which united to produce the parent, assuming no selection 
and that different series of allelomorphs are combined at random, 
(an assumption compatible with linkage among the genes and 
with inbreeding, but not with assortative mating). Using primes 
for the path coefficients and correlations of the preceding gen- 
erations. 


‘ F 
(13) b= Aggy =a’ C+F )= \J-— . 
Since @’= Viz (by equation 12, applied to the preceding 


generation) $a’ = 12, irrespective of correlation between the 
parents under the assumed conditions. 
a 


(14) FzetNM, M= RF 

The correlation between uniting gametes is directly related 
to the percentage of heterozygosis. Below is the correlation table 
between uniting gametes in a population in which genes A and 
@ are present in the frequencies of q and /-q respectively and 
the proportion of heterozygotes is #. By the usual formula for 
correlation 


Up 


i. « a 
(15) ‘ $O-H) 23 C- ¢) 


po = 2¢C-¢)('-F), 
All of the path coefficients and correla- 
tions have now been expressed in terms 
of Fs. Various applications can be 
made. As a simple case consider the 
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effects of continued brother-sister mating: Analyzing the correla- 
tion (M) Fig. 8, between the parents by tracing the connecting 


paths : — 
(16) M=2a' & (i+m'). 


Expressing all coefficients in terms of lille ee 
Fs and reducing v fi > 
(17) Fz My Ci+2F+F) ,, ———p. 


(18) F= ee +e. Fic. 8 


Thus the percentage of heterozygosis (to which the effects 
of inbreeding are directly related) is very simply related to the 
percentages in the two preceding generations. If there were ini- 
tially 50% i.e. % heterozygosis, that of later generations would 
be given by the terms of a series of fractions in which each 
numerator is the sum of the two preceding numerators (Fibo- 
nacci series) if the denominator is doubled in each generation. 
This rule was derived empirically by Jennings (1916) on work- 
ing out in detail the consequences of every possible mating, 
generation after generation. The analysis by path coefficients 
(Wright 1921b) not only demonstrates the generality of the em- 
pirical rule but can be applied as easily to more complicated cases 
in which the analysis by types of mating would be practically 
impossible. Consider, for example, the more general case of a 
population restricted to N., mature males and N, mature females 
(Wright 1931b). Under nae mating, the chance of a mating 


of full brother and sister is Nv, ox , of half brother and sister 


Nm a and of less closely related individuals 


(N,,-1 MM, - / e 
Ny», Ng 
The correlation between mating individuals is thus 


(19) m- de [£52 + (Anite -” bag Nese oe RE) wt) 


which yields on elias 

(20) , P's (AE \i-2e'ee') 
5 NM 

Opp! (Mare Vlepl-p") 
5M 








170 THE METHOD OF PATH COEFFICIENTS 





Equating 5 to o gives Gut an,) as the approximate rate 
: of reduction of heterozygosis per generation. The special case in 
which the population is equally divided between males and females 
(W,,= % = %) gives 35 as the rate of reduction, a figure recently 
verified by R. A. Fisher by a very different mode of analysis. 

The method has also been applied in the much more compli- 
cated case of assortative mating based on somatic resemblance 
(Wright 1921b). 

In the case of the irregular inbreeding encountered in live 
stock pedigrees (Wright 1922, 1923a), the basic formula of path 
coefficients leads immediately to the formula 


a M+ Myr! 

(22) F=Z2([C4) . (+E) 

where ™% and /y are the number of generations from sire and 
dam respectively to the common ancestor (4) at the head of 
each connecting path. By appropriate sampling methods (Wright 
& McPhee, 1925) this formula can be used in the study of whole 
breeds. Closely allied is the formula for the genetic correlation 
between any two individuals ( x,y). Letting w and w’ be the 
generations from X and Y respectively to the common ancestor 
of any connecting path 


= ey eB) 
V Ore C+) 


These formulae have been extensively applied in breed analy- 
sis (Wright 1923b-c, McPhee & Wright 1925, 1926, Smith 1926, 
Calder 1927, Lush 1932). 


(23) i+ * 


Multiple Regression 
The preceding applications have consisted in the main in the 
deduction of correlation coefficients from knowledge of the func- 
tional relations. The method can be applied as well to the inverse 
problem, that of finding the best linear expression for one varia- 
ble in terms of a number of others, from knowledge of the cor- 
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relation coefficients. No assumptions are made 
with respect to causal relations. Analysis of the 
correlations between VY, and the other variables 
(Fig. 9), by the basic formula, gives the follow- 
ing set of equations. 


At - & 


ol 


. oF 
Legg = ‘ey 21, 


° 


(24) 


ie ee, a eas ee 
277 


one of "1% oz on 

Obviously these are merely the normal equations of the meth- 
od of least squares in a slightly disguised form, as might be 
expected from the derivation of the basic formula. The solution 
for the path coefficients, expressed in terms of determinants, 
merely need to be multiplied by the proper ratio of standard 
deviations to give Pearson’s formulae for the partial regression 
coefficients. The method of path coefficients here merely furnishes 
a convenient mnemonic rule for writing the normal equations. 

The correlation between the actual values of Y, and the 
estimates ( v.) (Fig. 10) from each set of values of the other 
variables, (given by the regression equation) is Pear- 


son’s coefficient of multiple correlation. Let ‘a i 
stand for the array of residual factors of V.,_ in 
a 


a form independent of the known factors. We may 
write an equation of complete determination (Fig. 9) Fic. 10 


=P AW t+ Ee eK af 


; of of Ow ~ou 
Tr z : 
2 PF, m.=2/-A,, since eM 


oe OL _—_ 


But = sou (Fig. 10) Therefore 


* VZP, ee. . 


It is unnecessary to give illustrations of the use of the methed 
in obtaining ordinary estimation or prediction equations. 


(25) en 


OCi2--- 7) 
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A somewhat different type of application has been made in 
estimating the transmitting capacity of dairy sires (Wright 1932a). 
In this case the necessary correlations were deduced from Men- 
delian theory checked by observed correlations between the sire’s 
female relatives and his daughters. These correlations were then 
used to calculate the multiple regression of sire on daughters and 
their dams. 
Partial Correlation 

It is sometimes of interest to find the values which statistics 
would take, on the average, in data selected for constancy of one 
or more variables. 


2 


(26) ae 


- = 2 » 
= ° P sg | — A. 
Or-l2s*72 o(u) «, ou ° C }, 


o(12--72) 

a well known formula. Inspection of equations (3) and (4) and 
of the definition of %/,) gives the following for the standard 
deviation of V, due directly to V, , under constancy of V, etc., 
and for the related path coefficient and concrete partial regres- 


sion coefficient under the same conditions. 


Wa tis Vina 
al iktinsen - Go) ——_— = ot!) . 1(2.--rm) 


9, 
x 
(28) ual 2 i J50).2::- -- af dtthie T  e 
; oe ae z 
ee (—A2, (2--> mm) 
29 0g, ee 
( ) Oi ct - ©C1)-2 77 ms <, 


— 
As might be expected, the concrete coefficients of the multiple 
regression equation (<,, etc.) remain the same (on the average) 
in samples selected for constancy of one or more of the factors, 
while the abstract path coefficients are altered in value in such 
samples. ; 
The formula for partial correlation can be derived from the 


a. >a 2 
formula 9, = G Ci--2,, 


as applied to the data in which particular variables (V,--- V.,) are 
constant. 
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© an. C1- Ae, 
(30) oo oz vm) 


/-% 
———m = f= 


O22 om) 


- 2. 
1-26(2.+-m) 

This derivation leaves the sign uncertain but this is easily 
determined from a different approach. In Fig. 11, V., includes 


all factors of V, other than the desig- 

. ° Vy- 7-23 Vw 
nated independent variable V, and the 7 
variables V,:*: YU, which are to be held * 
constant. V, represents the residual V a 


° 


factor for V, , in relation to the factors 
V---. The value of P, in this sys- f 
tem is not of course the same as F, in Fic. 11 


the preceding discussion in which other variables than V,::: 
(those to be made constant) were treated as factors of V, . 
Since 
= ed -Z2 «+P 
2 ++ TH 


. Therefore 


(31) 


Thus has the same sign as 


ol-2 -*:™m oO; 


This is simply formula 28 except that it is in a set up in which all 
factors of V, except V, are held constant, in which case Toe —_— 
becomes 2%, 1 - 


. . a 
Since 1 Mle uy * e 
P 


i : 
““tia...m. * "sy 
2 2 
©C2 ++. mm) ef. #&, a OR Tiss 
letting V., represent the combination of \/, and V, the above 


i. il 
f— &, 


formulae for partial correlation can be written in a number of 
very compact forms. 
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(32) 


(33) 


(34) 


The first of these is identical with 30. 





Symbolism 


The most widely current symbol for a partial regression co- 
efficient is Yule’s expression €.. am Kelley (1923) uses a 


















similar expression 4, j-2 .-7q f0r the coefficients in abstract form. 
These have an advantage over the symbols used here (<,, , = 

respectively) in that they define certain absolute functions of the 
variables, while the latter symbols have meaning only in relation 
to a particular arrangement. This relativity of meaning can not, 
however, cause confusion as long as one is dealing with only a 
single system. If the problem is of a more complex sort than the 
calculation of a prediction formula, the (6 symbolism becomes 
too cumbersome for convenience. The current symbolism has the 
further disadvantage of a certain lack of logical consistency. In 
the expression 0,.,2 »4).25 and é.,. 23 the subscripts to the right 
of the dot are understood to represent factors held constant. In 
the expressions Z, ,, and /Z,,, this is not the case. If we wish 
to represent the multiple correlation of X, with X, and X,, 
independent of X,, or the beta (path coefficient) for the influ- 
ence of X, on X, in data involving also X, and X; but in 
which Xz _ is held constant, it would apparently be necessary 


under the usual symbolism to write such ambiguous expressions 





4,423.3 and [7 o/.23-3  Tespectively. Pearson’s method 
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of writing constant factors as subscripts to the left of the main 
symbol avoids these difficulties and is the one which I have fol- 
lowed in earlier papers. The dot symbolism has, however, be- 
come so firmly established in the cases of the standard deviation 
and correlation coefficient that it is probably best to recognize it 
as the general device for indicating constant factors and to replace 
it in those symbols in which it is used for a different purpose. 
There is no difficulty in the case of multiple correlation. The 
expression 2 47,2) may be used for the correlation of X, with 


X, and X, jointly and the expression Z is an unambig- 


uous symbol for the multiple correlation scmandnes of X35. 

As noted above, it is not desirable in the usual application of 
path coefficients to encumber the symbols with a list of the factors 
of which each dependent variable is treated as a function. This 
can be left to a diagram. Where a complete formal symbolism 
is desirable, the list of factors might follow a semicolon instead 
of a dot. Thus . ;:23,.3 Would unambiguously represent the path 
coefficient relating X, to X, in a system in which Xg is treated 
as a function of X,, X, and X, but in which X, is to be held 
constant. There is, however, little need for such complicated 
expressions. 

Quantitative Evaluation of Causal Relations 

While the method of path coefficients is directly applicable to 
such problems as the estimation of correlation coefficients from 
knowledge of the mathematical relations between variables, or the 
converse (multiple regression) it was developed primarily as a 
means of combining the quantitative information given by a sys- 
tem of correlation coefficients with such information as may be at 
hand with regard to the causal relations, and thus of making 
quantitative an interpretation which would otherwise be merely 
qualitative. 

How far such causal analysis has meaning is a question on 
which there is difference of opinion. Some authors (Pearson, 
Niles) have contended that the designation of the relation be- 
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tween two variables as one of cause and effect involves a false 
conception; that we can merely observe more or less perfect 
correlation. This view seems to imply that direction in time is 
of no significance, and indeed G. N. Lewis has recently argued 
for the complete symmetry of the physicist’s time. The common 
sense view that direction in time is a basic perception is not with- 
out support, however. 

Under the theory of relativity, the elementary physical reality 
seems to be the point event located at a particular position in the 
space and time of a particular viewpoint. The objective world 
is to be thought of as a complex network of point events. Al- 
though two such events sufficiently remote from each other in 
space, relative to their separation in time, may have their order 
of succession in time reversed in the systems of two dfferent 
observers, order in time is invariant along any strand of this 
network involving continuity of physical action. Thus the succes- 
sion of collisions suffered by a particular body or by a beam of 
light is the same to all observers. Such successions of events as 
involved in the movement of a shadow over a surface may indeed 
be reversed by change of viewpoint, if the shadow happens to be 
moving more rapidly than the velocity of light, but the continuity 
of physical action here is not along the path of the shadow but 
traces separately to each point in this path from the points of 
interception of the light. There is frequently difficulty in com- 
plex cases in distinguishing lines of direct causation from correla- 
tions due to common causation but in principle the distinction is 
clear enough. Experimental intervention is possible only in the 
true lines of causation. 

In the world of large scale events, certain patterns tend to 
recur. Certain recurrent successions of events come to be recog- 
nized, experimentally or otherwise, as lines of causation in the 
above sense. Different lines of this character may come together 
in a certain type of event or may diverge from one. In many 
cases a fairly adequate representation of the course of nature can 














SEW ALL WRIGHT 177 


be obtained by viewing it as a coarse network in which the 
“events” of interest are the deviations in the values of certain 
measurable quantities. A qualitative scheme depends on observa- 
tion of sequences and experimental intervention. It is of interest 
to make such a scheme at least roughly quantitative in the sense 
of evaluating the relative importance of action along different 
paths. This was the primary purpose of the method of path 
coefficients. 
Birth Weight of Guinea Pigs 

The simplest application of this sort has been in connection 
with the factors which determine the weight of guinea pigs at 
birth (Wright 192la). Minot (1891) noted that the average 
birth weight is smaller, the greater the size of the litter. He 
reasoned that this might be due either to a competition between. 
the developing foetuses, or merely to an effect of a large litter 
in stimulating somewhat premature birth. In confirmation of the 
latter hypothesis he found that the gestation period was several 
days shorter in large litters than in small ones and that there was 
in fact a direct relation between length of gestation period and 
birth weight. After some discussion, he concluded that the data 
afforded no evidence of growth competition and thus he decided 
in favor of the second hypothesis. I was able to confirm Minot’s 
observations, obtaining the following data in a large stock of 
guinea pigs. The mean birth weight (in grams) of the animals 
in the litter is the birth weight used. The interval between litters, 
whére less than 75 days is approximately the gestation period. 
Standard errors are given. 


Mean SD 
B (Birth weight) 8224105) 18602036 42,.= +5334 .020 
I (Interval) 68.93 * 0.05 9140.04 2,.+-,.058% 010 
L (Size of litter) 2.9/¢ 004 4.29 0.03 2,,=-457t .022 


The correlation between birth weight and size of litter was 
based on 3353 cases, the other two correlations on 1317 cases. 
Ir order to make a comparison of Minot’s two alternatives, 
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these may be represented graphically in a single diagram. 
Birth weight { is completely determined (in the mathemati- 

cal sense rather than causally) by the prenatal — curve and 

the age at which growth is interrupted 

by birth (G) . It is assumed that the = 

rate of growth (/y) immediately before 

birth is a sufficient index of the growth a 

function and that the rate of growth is Pm: 

uniform at this time to a sufficient degree 


\4 Ve 


of approximation. In substituting ges- 
tation period for interval a small correc- Fic. 12 
tion is desirable. On grounds which need not be gone into here, 
it is estimated that the correlation between interval and true ges- 
tation period is about .95. No correction is necessary for birth 
weight since there is little or no growth in the first day after 
birth. The correlations involving interval must be divided by .95 
to obtain estimates of those involving gestation period. 
tag 56, 2,,2-.48, while 2,,-=66 is unchanged. 

Minot’s problem resolves mathematically the analysis of the 
observed correlation between birth weight and size of litter into 
the sum of two composite path coefficients representing the two 
postulated paths of influence. 


Peg * Fen * foee Ds 
The method furnishes at once four equations for determining 
the values of the four path coefficients. One of these expresses 
the complete determination of G by R and G. The others are 


the expressions for the three known correlations. 


(35) ro . +40 6G G sd 

(36) Fae at "se tae 
(37) Fina”  * i. =+.56 
(38) ce 48. 
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These are not all linear equations, a condition which generally 
distinguishes this sort of applicaton of the method from the cal- 
culation of partial regression coefficients. In the present case, 
however, there is no difficulty in the solution. 


PB =+.87, Fo«+.30, Fo« -=59, Bs ~. 98, 


BR eG RL GL 


Fs: «te Ag, = 166. 


BRL ’ Bel ? BL” 

The result is an analysis of the correlation between birth 
weight and size of litter into two components whose magnitudes 
indicate that size of litter has more than three times as much 
linear effect on birth weight through the mediation of its effect 
on growth as through its effect on the length of the gestation 
period, contrary to the results of Minot’s verbal analysis. 

In this case, the answer to Minot’s question might have been 
obtained from a set up mathematically identical with that used 
in multiple regression (after correcting the correlations with inter- 
val to obtain estimates of those with true gestation period.) 

By equation 24, 


(39) Ag, - 


(40) A « Ba L_—_—_-#__, 


BG BL ~ GL — * a a 
2G 


The term Fos ® ~.5/ can be interpreted as 
measuring the influence of size of litter on birth Fic. 13 


weight in all other ways than through gestation period. In other 
cases, however, proper causal analysis may require a set up utterly 
different from that used in obtaining the best estimation equation. 
There is no routine method of making the proper diagram in the 
former case. This seems to have occasioned more misunderstand- 
ing than anything else among those who have attempted to apply 
the method. One author in a critique of the method, took the 
form of diagram intended to represent the sequential relations in 
the case of guinea pig weight and arranged some variables relat- 
ing to basal metabolism in man in the same scheme in an arbitrary 
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way and then complained of the meaningless and absurd results 
which he obtained! 
Transpiration of Plants 

The contrast between the kind of set up appropriate to an 
estimation equation and that for evaluation of a causal interpreta- 
tion was illustrated early (1921la) in connection with a study of 
the data of Briggs and Shantz on transpiration in plants. The 
reader is referred to the paper for the details, but it may be 
appropriate here to compare the different diagrams used. The 
authors obtained the total daily transpiration of a number of 
plants. The environmental factors studied were total solar radia- 
tion (RX), wind velocity (W), air temperature in the shade (7) , rate 
of evaporation from a shallow tank (F) , and wet bulb depression, 
sheltered from sun, but not wind (8B) . To avoid seasonal effects, 
the logarithms of ratios for successive days were used instead 
of absolute values. 

An estimation equation for wet bulb depression was obtained 


in terms of wind velocity, solar radiation and temperature (Fig. 
14). 


It was pointed out that for causal analysis, {a '~ 
~o| (T—_t47 Sp 


radiation should be omitted as not affecting fo 
wet bulb depression in the shade, while a fac- R fi 
tor not directly measured, absolute humidity Fic. 14 

(H) should be included. There should be complete determination 
of B by W, 7 and H. As so arranged, there are two more 
unknown coefficients than known ones. It was assumed that there 
was no correlation between absolute humidity and wind velocity. 
The necessary additional equation was obtained from the theo- 
retical multiple regression equation relating GB toW, 7 and #H , 
by substituting the extreme differences in wet bulb depression, 
temperature and wind velocity of the average daily cycle and 
assuming the absence of any such cycle in absolute humidity. 
Possibly this was not wholly justified in this case. If so, no 
numerical evaluation of the chosen point of view could be made. 
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Even in such cases, the attempt at analysis by path coefficients 
may be valuable in locating deficiencies in the data already col- 
lected and suggesting the kinds of new data which should be 
obtained. 

The final set up used in relating transpiration 7, and evap- 
oration from a tank to wet bulb depression and the chosen en- 
vironmental factors is given in figure 15 with the values of the 
path coefficients and correlations. De- 
terminations were made for 10 varie- 
ties of plants. These gave fairly con- 
sistent results which are averaged in 
Fig. 15 although there were certain 
interesting differences. There was a 
marked difference between the tran- 
spiration of the plants and the rate 
of evaporation from the tank in the 
relative importance of the various 
factors. 

The Relative Importance of Heredity and Environment 

Among the most satisfactory applications to causal relations 
are to problems of genetic determination. The development of 
an organism is the product of the confluence and interaction of 
two distinct streams of causation, heredity and environment. The 
interaction between the hereditary influences emanating from the 
nuclei of the cells of the organism and the influences coming 
from outside these cells, but largely from other parts of the body, 
where they in turn are the products of heredity and cell environ- 
ment and so on back to the one cell stage are complex enough, 
but if we go back of this to the ultimate factors: the array of 
genes assembled at fertilization and the environmental conditions 


external to the organism, the sequential relations are for the most 


part clear. The problem is that of determining the relative im- 
portance of differences in heredity and of differences in environ- 
ment in determining differences in the characteristics of individ- 
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uals in a given population. The principal complications are the 
possibilities of nonlinearity in the combination effects of different 
genes, of different environmental factors, and of heredity and 
environmental factors in relation to each other. 

We will review a case, the amount of white in the coat pat- 
tern of certain strains of guinea pigs, in which such combination 
effects appear to have been of negligible importance (Wright 
1920, 1926b). A stock of guinea pigs was maintained for many 
years by the U. S. Bureau of Animal Industry without outcross, 
but with the avoidance of even second cousin mating. The cor- 
relation between mated individuals (143 pairs) was +.060 +.08%3, 
indicating that mating actually was at random in respect to coat 
pattern. The correlaton between parent and offspring averaged 
+./7/ £.024 % with no significant differences in relation to sex. 

By the theory developed on page /67, %»° aé. Allowing for 
incomplete determination by heredity this becomes 
(41) Aggy © AT = hat=-4h’* 

Thus A4*=.38 leaving .62 for determination by environment. 
The correlation between litter mates averaged +.282 + .027 
In the case of litter mates it is necessary to distinguish two 
groups of environmental factors—ones | 
common to litter mates (E) and ones oe 
peculiar to individuals (D). From the 


diagram H, ow, WA 
(42) 24° 2ha't+e =-£h'> e* KE 

where 2 (= Rag) is the determination 4 = NS 
by common environment. Its value is 2 _ p40, 
09, leaving d*=.53 as the determina- 


tion by nongenetic factors not common Fic. 16 


p—L+o, 


to litter mates. It seems rather surprising that the environment 
common to litter mates should determine so little in a character 


3 These standard errors, obtained from values in different subdivisions 
of the data are larger than would be obtained from the 3881 parent-offspring 
pairs, which however necessarily involve much repetition of individuals. 
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graded at birth, but only very minor effects of this sort have been 
discovered experimentally, the most important (contributing .036 
to the correlation of litter mates) being an effect of the age of 
the mother. The high degree of asymmetry of the pattern in indi- 
vidual animals is in harmony with a large element of chance 
(somatic mutation?) in the determination of pigmented areas. 

The above estimates (> .38, d=.53, e*=.09) are estimates 
of the portion of the variance due to heredity, non-genetic factors 
peculiar to individuals, and common environment, respectively. 
They are the portion of the variance which should be eliminated 
by control of each factor. It is not possible to control the rather 
intangible environmental factors but hereditary variation can be 
eliminated by close inbreeding (decrease of heterozygosis being 
about 19% per generation under brother-sister mating). It hap- 
pened that a number of piebald stocks were on hand, each de- 
scended from a Single mating after several generations of in- 
breeding. These differed markedly in average percentage of white 
in the coat, although individuals of each varied widely about their 
family averages. Crosses between strains at opposite extremes 
gave intermediate offspring, justifying the assumption of no 
dominance. The family (No. 35) most advanced in inbreeding 
was descended from a single mating in the 12th generation of 
brother-sister mating, but even in it there was variation from 
nearly solid color to solid white. As expected by theory, very 
little, if any, of this variability was hereditary. The correlation 
between parent and offspring was only #+024+4.020 . The cor- 
relation between litter mates was +./03+.025 , again indicating 
only a small amount of influence of environment common to litter 
mates. 

The standard deviation, measured on an appropriate scale* 

*On a percentage scale of measurement, necessarily limited at 0% and 
100%, a given factor has more effect near the middle of the range than 
near the limits. The appropriate transformation of the scale X , ranging 


’ = ¥, . “@* . 
from Oto lis X =ref (x- 50) where Pf is the inverse probability function, 


‘ 
xz a 


2 
' “2 
the direct function being defined in the form pa x = iar fe Az. 
(Wright 1926a). ” 
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came out .574 (about 22% of the area of coat in the neighbor- 
hood of 50%). In the random bred stock the standard deviation 
was 0.782 (about 28%). The variance of the stock in which 


2 
: ati — 7 
hereditary variation had been eliminated was thus 54% = 27) 


.78Z 
of that of the random bred stock. This agrees as well as could 
be expected with the estimate of 62% of the variance of the latter 
as nongenetic, based on the parent offspring correlation, although 
not as well as an earlier estimate made when the numbers were 
smaller (nongenetic variance 58% as deduced for a parent-off- 
spring correlation of +,.2/ in random stock, variance of inbred 


family 57% of that of random stock, ée, Jer ). 


Case of Human Intelligence 

Another illustration of the difference between a quantitative 
interpretation and a multiple regression formula has been given 
(Wright 193la) using data of Miss B. S. Burks, on the roles of 
heredity and environment in determining human intelligence. 
These data consisted of intelligence tests of 104 Califernia chil- 
dren, tests of their parents and in addition grades of home en- 
vironment. Similar data were obtained of 206 children adopted 
at an average age of 3 months, and of their foster parents and 
home environments. The correlations as used were corrected by 
Miss Burks for attenuation. 

If the purpose is to obtain the best estimation for children in 
terms of their parents and environments, the variables are to be 
related as in figure 17 in which C_ is child’s intelligence, P is 
midparent and F is the measure of home environment. 


Normal Equations Children 
(Own) (Adopted) 
(43) PeazptRe + Ap = t.b/ +.23 ie 
7) £,+64,* ae * +.29 —S 
(45) Rep = +.¥6 + .86 
Solutions : e ++. —.07 Fic. 17 











-.&@ +.35 
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The solutions of the normal equations in the two bodies of 
data give what at first sight appear to be contradictory results. 
There is no apparent reason why environment should not play as 
great a role in shaping intelligence in one case as in the other, 
yet it turns out that while the partial regression of child’s IQ on 
home environment is significantly positive in the foster data, it is 
negative as far as it goes, in the case of own children. 

The point that is sometimes overlooked is that the arrange- 
ment for obtaining the best possible prediction equation does not 
necessarily yield coefficients which have any simple interpretation. 
This is obviously the case here. If child’s 10 is affected both by 
heredity and environment, the same is presumably true of paren- 
tal IQ. In so far as the latter is determined by environment it 
is not a causal factor in relation to child’s heredity. A diagram 
intended to represent causal relation must represent parental IO 
as merely correlated (two headed arrows) with child’s heredity 
and child’s environment. Another complication which must be 
represented is the correlation of heredity with environment. Good 


heredity in a family will tend to create a good environment and 


vice versa. The simplest possible interpretative diagram for own 
children is thus of the type of figure 18. That for foster children 
is given in figure 19. 


Even these are doubtless too simple since 
heredity is represented as the only factor apart 
from the measured environment. Any estimates 
of the importance of hereditary variation will 
thus be maximum. 

The two correlations given by Miss Burks 
in the case of the foster data (Fig. 19) 


(Aeg * +29, Rey*-23) 


yield the value Step = +.79 
for the correlation between home environment 
and midparental IQ. The actual correlation was Fic. 19 


not published for the foster data, but there is no reason why it 
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should differ significantly from that in the other data in which 
the value was +.%6. There is reasonable agreement. 
In the case of own children, three correlations of interest 


here, were published, 2,.= +47, 2cp2t.6/, 2ep= 1.86, 


E 

But this is not enough to give a solution for the aiid of 
the 5 indicated paths. The assumption of complete determination 
of C by H and & gives a fourth equation, still an inadequate 
number. No solution is possible, a situation which as previously 
noted, very frequently arises in such analysis, even when one 
makes the most simplified possible qualitative representation of 
the causal relations. A great deal of utterly unwarranted verbal 
interpretation of correlation coefficients would be avoided if the 
authors took the trouble to represent their ideas in diagrammatic 
form and noted whether or not the number of equations possible 
from the data (known correlation coefficients and known cases of 
complete determination) was as great as the number of paths in 
this diagram. 

In the present case, another equation can be obtained by bor- 
rowing from the foster data. Environment should make approx- 
imately the same contribution to IQ in both groups of children. 
The concrete partial regression coefficients 


Oe c 
Coy (= Ge =) and Cc¢ (= A 9) 


should thus be approximately the same in the foster as in the own 
children. Assuming that %, ¢nd 0, are the same in both cases, 


the ratio arts from the foster data may be accepted for the 


group of own children. The five equations now available are as 
follows: | 


Equations . Solution 
(46) A~ p = +,%6 
(47) Ag = +49 Eth 4, ce * * 27 
(48) Rep = +.G/ * fe Mee? Ete ca = - .79 
(499) R, + +.302 EK, Ry = + 42 


(50) 
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The solution assigns reasonable values in all cases and shows 
that there was no real disagreement involved in the relation of 
the two groups of children to their environments. 

It was noted that this analysis gives a maximum estimation 
of the role of heredity. An attempt was made to obtain a min- 
imum estimate compatible with acceptance of the observed cor- 
relations, by carrying the analysis back a generation and assuming 
as much similarity in the determining factors of successive gen- 
erations as the data permit. 

Such analysis requires separate treatment of heredity (H) as 
a factor of development, and heredity or genotype (G) as the 
linear system of gene effects which best approximates the former. 
Departures from linearity in the effects of allelomorphs (dom- 
inance) and in the effects of nonallelomorphs (epistasis) are 
common. Moreover there may be non-linearity in the combina- 


tion effects of heredity and environment. Thus a certain genetic 


complex in the guinea pig (c*c*BB) produces more melanin pig- 
ment at low temperatures than does a certain other (CC&€) but 
less at high temperatures (Wright 1927). The subject is too in- 


volved for detailed discussion here but it may be noted that in 


general correlations between deviations due to dominance and 
epistasis must be taken account of. 

In the case of Miss Burk’s data, there is no possible way of 
distinguishing the effects of environmental factors not included 
in the measurement of home environment from the contributions 
of dominance and epistasis or from non-linearity in the combina- 
tion effects of heredity and environment. In the attempt at ob- 
taining a minimum estimate of heredity, these three very diverse 
factors were put together in a miscellaneous group M . The dia- 


gram of relation used is Se. 
given in Fig. 20. Child’s +8 


genotype (G) is represent- ee nae TN 


/ 

: i +20, mM Pp m—tia> oC 
ed as partially determined a Re as 
by midparental genotype <——— 
(G'), the residual variabil- 


ity being that of Mem >- Fic. 20 
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lian segregation. Child’s environment is treated as in part deter- 
mined directly by midparental intelligence / and in part tracing 
to the environment of the preceding generation ,E 

The path coefficient relating genotype of midparent to that of 
child could be estimated, assuming Mendelian heredity and taking 
into account a correlation of +.70 between father and mother. 
It turned out to be mathematically impossible to assign the same 
values to the path coefficients of the parental generation as in the 
offspring generation, but this is not surprising since the parents 
were tested as adults instead of young children. The solution for 
the parent generation was to some extent indeterminate but within 
rather narrow limits, on making what seemed the most reasonable 
assumptions. The values reached are given in figure 20. The 
path coefficient for influence of hereditary variation lies between 
the limits +7/ (if dominance and epistasis are lacking) and + .90. 


Analysis of Size Factors 

The first published application of the method was to the in- 
terpretation of a system of correlations of bone measurements 
(length and breadth of skull, lengths of humerus, femur and 
tibia) in a population of rabbits (Wright 1918). The 10 ob- 
served correlations were accounted for primarily as due to a 
single general factor (not necessarily acting proportionately on 
the 5 variates). The residuals which appeared were attributed to 
group factors. 

In a recent paper (1932b) the same figures, two other sets 
of figures for rabbit populations ( F and f of a wide cross) 
and figures from a flock of hens have been analyzed by a some- 
what improved method. A set of 7 variables yields aie cor- 
relation coefficients and-hence the same number of observation 
equations of the type -2,,-R.F, , where A and & are two 
of the variables and G is the general factor and it is assumed 
for the moment that the correlations are due solely to differences 


in general size. The residuals are minimized by the method of 
least squares. 
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This method necessarily gives residuals which are as likely to 





be negative as positive. The interpretation is more satisfactory 





if the path coefficients relating each measurement to the general 





factor are all reduced by the proportion necessary to eliminate 





significant negative residuals. It happened that in each of the 





4 sets of data studied, the most important negative residuals were 





those between the skull and hind leg measurements, and the 





method followed was to eliminate the average of these. 





The important positive resid- te 
uals in all cases indicated natu- aa 
>» factors—a head group, Head 
ral grout tors d group nowt, is 
a general leg group, a foreleg a 





group (in the one case in which 
both humerus and ulna were 
measured) and a hind leg group. 
Other indications such as a 







slightly closer relation between 
head and foreleg than between 
head and hind leg, slightly closer 






relation between proximal leg 





bones (humerus and _ femur) 





than between non-homologous Fic. 21 

and hind leg bones (humerus and tibia) were less certain. Figure 
21 shows the system of path coefficients arrived at in the case 
of the fowl measurements. The squares of these give the degree 


of determination in each case by the general factor, the group 







factors and special factors. 







The Use of Partial Corrclation in Interpretation 


Partial correlation coefficients have sometimes been used in 





the attempt to interpret systems of correlated variables apparently 
on the theory that the reduction or elimination of a correlation 
between two variables on holding a third constant demonstrates 


the latter to be causally responsible for the correlation. The 
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method at first sight seems analogous to that of the experimen- 
talist in attempting to control all sources of variation except those 
in which he is interested. This, however, is a delusion in the case 
of correlation (as opposed to regression) coefficients (Wright 
1921a) and the method of path coefficients was developed because 
of the unsatisfactory nature of interpretation based on partial 
correlation. As R. A. Fisher (1925) has stated, “In no case, 
however, can we judge whether or not it is profitable to eliminate 
a certain variable unless we know or are willing to assume a 
qualitative scheme of causation.” 

This point can be illustrated by considering a system of 3 
variables, A , & and C in which the following correlations have 
been found. 

Angs-7° Ane * .5O Slag = +23. 

By substitution in the usual formula, a This is 
compatible with the interpretation, represented in figure 22, that 
fH is an intermediary in a single chain of causation connecting 
C and A. 

Mae Ca ae * Mg iy” 25 cB “A 

Another interpretation is that G@ is the - 
only common factor Fic. 22 

_ te 
Pig, * Fs a * Ane ve * 29 . ol 
But it is also possible that 4 may be the 
product of the interaction of two correlated 
factors A and C 


Sz ite = SO. 


AB BA 
Finally A , G and C may be correlated 
with each other through reciprocal interac- Fic. 24 


AC 


tions or through complexes of unknown common factors, making 
impossible anything beyond the mere descriptive use of the cor- 
relation coefficients, or the calculation of estimation equations. 
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The first step in the application of the method of path 


A 
coefficients is to bring clearly into the open the system ( 
of functional relations among the variables which .5| ,8 
seems significant for purposes of interpretation. In fn 

Cc 


the majority of cases, verbal interpretations which 
seem reasonable enough as long as the basic postulates Fic. 25 

are kept discretely in the subconscious mind become obviously 
crude and inadequate when expressed in a diagram. Occasionally, 


however, statistical systems are capable of some interpretation. 


Difficulties in Causal Analysis 

There are a great many systems of correlated variables for 
which no interpretation can be suggested in terms of sequential 
relations. Among these are cases in which there is prevailingly 
mutual interaction between the variables instead of action in one 
direction. The branches of science differ considerably in the type 
of relation which predominates. 

As already noted the developmental process of organisms is 
essentially a one way process, and the ultimate factors of devel- 
opment, heredity and environment act on it without being acted 
upon. A method of analysis which takes account of the sequential 
relations is thus imperatively called for in genetics. 

A case in which such analysis would not be possible may be 
illustrated by the relations among the various properties of the 
blood, as discussed by L. J. Henderson. The physiological mech- 
anisms are such that alteration of any one brings about immediate 
readjustments in the values of the others. What one wishes to 
determine are the functional relations, whether in the form of 
equations or of nomograms. If such a system were studied by 
correlational methods the best that could be done would be to 
attempt to approximate the functional relations by multiple re- 
gression (linear or curvilinear as the case required). 

There is usually rapid reciprocal action among the variables 


of interest to the economist or sociologist and the correlations 
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among the simultaneous deviations cannot, in most cases, be 
treated as due to lines of one way causation among these varia- 
bles themselves. Thus the price of a commodity cannot properly 
be treated as caused by the amount marketed or vice versa. The 
exception is where one variable is clearly external to the social 
system in question as is the influence of weather on crop yield. 

There is more likelihood of being able to represent the various 
simultaneous deviations as direct consequences of the system of 
deviations of the preceding year (together with the clearly exter- 
nal contemporary factors) but even here, a causal diagram can 
be set up only after a most careful consideration of the realities 
of the case. There may be lags of greater duration than one 
year and a correlation between two variables in successive years 
may trace to more remote common factors rather than to a direct 
line of causation from the earlier to the later. 


Corn and Hog Correlations 

These points were illustrated by a study of corn and hog cor- 
relations (Wright 1924). An attempt was made to analyze the 
play of interacting factors responsible for the annual fluctuations 
from the general trends in production and price of hogs during 
the relatively undisturbed period between the Civil War and the 
World War. It was shown that variation in the corn crop and 
certain interrelations among the hog variables themselves deter- 
mined from 75 to 85% of the variance of the latter. The annual 
fluctuations about the trend during the period of years from 1871 
to 1915 inclusive (so far as data were available) were found for 
corn acreage, yield, crop and price and for western and eastern 
wholesale hog packs and for farm price of hogs. The fluctuations 
were found separately for the summer and winter seasons for 
western wholesale hog pack and the corresponding live weight, 
pork production (product of preceding) and price. Correlation 
coefficients were found not only for the same year but between 
variables separated by one, two and often three years. Altogether 
510 correlation coefficients were calculated. 
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Most of these coefficients could be given reasonable enough 
verbal interpretations, but there was no assurance that the “‘ob- 
vious” interpretation in one case, was compatible with an equally 
“obvious” interpretation in another. The problem was to repre- 
sent all of these verbal interpretations in a single diagram and 
determine path coefficients which would account simultaneousiy 
for the entire system of correlation coefficients. With 510 cor- 
relation coefficients and 4 cases of complete determination, one 


could write 514 simultaneous equations to determine the values 
of whatever system of path coefficients had been used. Theoret- 
ically one could introduce the same number of different paths 
into the diagram. It would not be practicable, however, to deal 
with such a large number of unknown quantities and even if prac- 


ticable, the complexity of the system would defeat the purpose 
of the analysis. The problem thus resolved into the discovery of 
a simple system of relations which would give a reasonably close 
approximation to all of the correlation coefficients. 

It has been emphasized that the method of path coefficients 
is not intended to accomplish the impossible task of deducing 
causal relations from the values of the correlation coefficients. It 
is intended to combine the quantitative information given by the 
correlations with such a qualitative information as may be at 
hand on causal relations to give a quantitative interpretation. The 
analysis of cases such as the present and that preceding (size 
factors), in which the equations far outnumber the coefficients 
to be determined, may appear to be exceptions to this statement, 
but even here only such paths are tried which are appropriate in 
direction in time and which can be given a rational interpretation. 

Considerable experimentation was necessary before a simple 
system could be found which gave even moderately satisfactory 
results. The procedure followed was to list the highest five cor- 
relations of each variable with a preceding variable. It turned 
out that the corn variables were so nearly independent of condi- 
tions in preceding years that they might be tréated practically as 
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independent in relation to the hog situation. The variations in 
corn crop depended largely on variations in yield (Fe. = +:40) and 
secondarily on variations in acreage (R,=+.45) . Corn price 
showed a correlation of <0 with the crop. 

Among the hog variables, the maximum correlations were with 
those which indicated most directly the amount of breeding (av- 
erage summer weight (sw) of the same year, winter pack (w P) 
a year and a half later, between which there was a correlation of 

+.78) , and with the preceding prices of corn and of hogs. The 
four variables: breeding (6) , summer (S$) and winter (w) price 
of hogs and price of corn (P) were thus chosen as a central sys- 
tem. 36 equations could be written involving these (using jointly 
the two indicators of breeding). Values of 13 path coefficients 
were tested by repeated trial and error until it seemed that no 
change (of the order of .05) would give improvement. The 
system reached is shown in figure 26 in which primes refer to 
preceding years. 

The other variables were then appended to this system, also 


by the trial and error method. Corn crop was used in place of 
° 


corn price, however. The results are shown in figures 27 and 28. 
These bring out the very different characteristics of the summer 
pack (SP) {consisting of a very heterogenous lot of hogs) and the 
winter pack, (wp) largely consisting of the spring pig crop. Aver- 
age summer and winter live weights are represented by (Sw) and 
(ww) in figure 27. 

The general conclusions were that the dominating features of 
the hog situation are the corn crop and its price, and an innate 
tendency to fall into a cycle of successive overproduction and 
underproduction, two years from one extreme to the other, de- 
pending mainly on two compound paths: 

owe’ = —42 and laa seated 
The 32 indicated path coefficients together with 10 others relating 
total annual western pack to its components, eastern pack to 


western pack and prices, and farm price to packer’s price, ac- 
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counted for the 510 observed correlation coefficients with an 
average error of only .09 neglecting sign. The most serious dis- 
crepancies were in certain correlations involving corn acreage and 
yield which were intentionally ignored for the sake of avoiding 
complexity in the relations of the more important variables. 


+.10 B 


a 


se 


Fic. 28 


The Elasticities of Supply and Demand 
In the preceding illustration, market supplies, prices, etc. were 
related to preceding conditions largely by a trial and error process 
of finding the system which would work best and without much 









196 





THE METHOD OF PATH COEFFICIENTS 












regard for theoretical considerations. In the following more 
theoretical approach I have collaborated with Dr. P. G. Wright. 
The purpose is to interpret observed series of prices and quanti- 
ties marketed as functions of two hypothetical variables, the con- 
ditions of supply and demand. Only a brief reference has previ- 
ously been published (P. G. Wright 1928). 

The demand for a given commodity and given market is 
treated as that function of all economic factors (prices, wages, 
etc.) which determines the quantity which would be purchased 
under any set of postulated conditions. The supply function, 
similarly, is treated as that function of all economic factors 
(prices, manufacturing costs, weather, etc.) which determines the 
quantity which would be offered for sale under any set of postu- 
lated conditions. The actual values which these functions take at 
a given moment tend to be the same, the price of the commodity 
itself being the immediate factor which shifts to such a value as 
to make them identical. 





















We shall deal with the annual percentage deviations in quan- 
tities and prices, whether from the preceding year or from the 
estimated trend of a series of years, instead of absolute values. 
The relative merits of these two procedures need not be gone into. 

Let X represent values on a scale of percentage change in 
quantity and Y values on a scale of percentage change in the price 
of the commodity in question. Let Z,, 2, etc. represent other 
economic factors of demand or supply or both on whatever scales 
are most suitable. The demand and supply functions themselves 
as percentage deviations in quantities under postulated conditions 
may be represented by Xy and X; respectively. 


(51) Me eh 4%, 2, Bo, Ba 3 


(52) k= £0%, B, Bury Bar). 


SIf the absolute quantities are represented by CL’ and the absolute 
prices by V , X= “o> and Y= ay . It is customary to define the 
demand and supply functions in terms of the absolute values, but for the 
present purpose it is more convenient to define them in relation to the 


percentage deviation. 


) 
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Assume that these functions are of such a nature that the 
deviations in price can be separated linearly from the other fac- 
tors to a sufficient degree of approximation. This does not imply 
lack of correlation between price and the others. 

(53) X,="2Y+D where oh C5, 4", Zi) 


(54) X-eY+S where S-f (%,2,°,4,--). 
The demand function is here analyzed into two variable com- 

ponents, a multiple of the price deviation(” Y) and the deviation 

(D) in the quantity which would be purchased if there were no 

price deviation (Y=0) . The supply function is similarly analyzed 

into a different multiple of the price deviation (e Y) and the devia- 

tion (S) in the quantity which would be offered for sale in the 

absence of a price deviation. Thus D and S measure the strength 

of demand and supply apart from price and will be spoken of as 

measures of demand and supply. 

For given values of D and 5 

the equations define two straight 

lines which describe the momen- 

tary demand and supply situa- 

tionsrespectively (Fig.29). Their 

slopes relative to the Y -axis are 

given by 1) and e respectively. 

These slopes are in accordance 

with the customary definitions of 

the elasticities of demand and Fic. 29 

supply, recalling that X and Y are percentage deviations.° 
According to the usual theory, the actual quantity which 

changes hands and the actual price are determined by the point 

of intersection of the supply and demand curves. Under the ap- 

proximations previously assumed, and assuming constancy of the 

elasticities, but variation of D andS , the percentage deviations 


6 The ratio 2 = “—< where U and V are absolute quantities and 
prices respectively. The ratio Xs is the elasticity of supply @) if s=o, 
and *» is the elasticity of demand () if D-o. 

Y 
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in quantity (Q) and in price (P) are linear functions of D and S. 
Their values may be represented as determined by multiple re- 
gression equations. It will be convenient to use single letters for 
the path coefficients. 


55 P= #2 D+ “as 
(55) P* *e 4 

_ Oa, Oe. 
(56) Q - aD - bel: 


The elasticity of supply may be obtained from the ratio of @ 
to P under a fixed average supply situation ( S=o ) but varia- 


ble demand. 
x 
Pi Ip 
Similarly, elasticity of demand is given by the ratio of Q to 


P when D equals zero. . 
(58) 1 = = , 

Since the standard deviations are obtainable directly from the 
data it is merely necessary to find the values of the path coeffi- 
cients in order to calculate the two elasticities. 

A diagram can be set up as in Fig. 30 indicating primarily that 
F and @ are different linear functions of D and S . Three equa- 
tions can be written at once; two indicating complete determina- 


tion of F and Q by D and S, and one representing the correla- 
tion between F andQ. 


(59) + +274 sp = / 
(60) 3, + 3, + 2 bi Br A~sp =/ 


(61) A, th 6. +R 7,+# 3, )sp° “ap: ( > 


Unfortunately these three equations involve P r,°@ 
5 unknowns. Other data must be brought to bear ( 
on the problem before any solution is possible. 


s 
B 


The diagram suggests two possible sources of 
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additional data. If any measurable quantity (4) can be found 
which is correlated with the demand situation but which can 
safely be assumed to be independent of the supply situation, 
(44, = ©) we can write two new equations representing the 
correlations %,, and 2,, respectively at the expense of only 
one additional unknown (2,,,> 4) . We have now 5 equations 
and 6 unknowns. If it can safely be assumed that there is no 
correlation between the demand and supply situations (z,,-0) , a 
solution is possible. If such an assumption with regard to 2,,, 
does not seem justified, it may be possible to find a quantity () 
correlated with the supply situation (as measured by S but of 
such a nature that no correlation with the demand situation need 
be postulated. The correlation 2,,and 2,, make possible two 
more equations, with only one more unknown (2,,°5) bringing 
the number of equations and unknowns both up to 7. The path 
coefficients and hence the elasticities are now determinate. The 
additional equations are as follows: 


(62) 2yp = Ae (4) 25,7 %5 


(8) Ang +B, (5) 2,5: $.5+ 


The hog and corn data referred to in the preceding section 
were not obtained with the present purpose in mind, but may 
furnish rough illustrations of the method. The total weight of 
hogs, marketed at the principal markets in the summer season 
(March to October) 1889-1914, and the reported price may be 
considered first. Absolute instead of percentage deviations from 
trend were used but the correlations should not be affected much 
and coefficients of variation may be used in place of the standard 
deviations on a percentage scale. The most important single fac- 
tor affecting the summer hog pack was shown to be the corn crop 
of the preceding year. It is assumed that it is a factor of type 8, 
correlated with the supply situation as measured by S but not 
with the demand for pork as measured by D . It is further as- 
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sumed that there was no correlation between the supply and 
demand situations. 


Data 

Coefficient of variation—Price 6, = 15.86 

Quantity m= 1989 
Correlation—Price with quantity Rpg ® — -03 
Correlation—Hog price with preceding corn crop 2x, - - .#7 
Correlation—Weight of pack with preceding corn 2,, - + .64 

crop 
Equations Solution 

42," + ps = | f= +.68 € =+./33 
g°° a Ot fz = —- 128 N- = H4 
Ltt AG o-.63 4,7 132 
Zz S cs =, 47 Gx + 997 
3,5 s * .64 S = +646 


The solution indic 2tes very little elasticity of supply (¢=+./33) 
but a very considerab!. elasticity of demand (7,- -.74¢4#). 

Similar data were .iven for the winter weight of pack (1870- 
1914). The largest correlation with a factor of preceding years 
was with average summer live weight of hogs, one and one half 
years before. This factor (an index of amount of breeding) is 
again assumed to be related to the supply but not to the demand 
situation, and again it is assumed that the supply and demand 
situations vary independently of each other. 


Data Solution 
no pp, «= t .656 e=-t.//0 
G@ > '8.75 f+ - £795 7,° ~: 884 
Ren * ~ -68 q,=+ 108 
Rpg = — -65 “.° + 994 


2en * # .#3 S$ *¢ # «825 
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The results are remarkably close to those of the quantity and 
price of summer pork. The low elasticities of supply are to be 
expected of an agricultural commodity the quantity of which is 
largely determined in advance and by factors independent of the 
market demand and which once produced must largely be 
marketed. 

I am indebted to my colleague, Professor Henry Schultz, for 
data on the quantity and price of potatoes marketed annually 
from 1896 to 1914 and the suggestion that it would be interesting 
material for analysis by this method. Trends had been fitted by 
Professor Schultz and trend ratios of quantity and price obtained. 

Data 
Standard deviation of price ratios 185 
Standard deviation of quantity ratios 430 
Correlations 
Price—quantity (same year) 2p —.852 
Price—quantity (preceding year) 2 pa! +.570 
Price—price (preceding year) pp —.562 
Quantity—quantity (preceding year) 24, —-522 
Quantity—price (preceding year) ep +.65/ , 

It is assumed again that there is no correlation between supply 
and demands situations (7,,-e) and 
that the price (as a trend ratio) is a p' p—t-i2. 
factor of type 6 affecting the sup- “ Ne « 
ply of the following year but without 3 5 Ls 39 
influence on the demand of the follow- 
ing year. The solution is as follows: Fic. 31 


4p, = +,593 % = +.024 e-+.034 


73,7 ~ FoF G.=+.9997 = --815 


S$ = +.65/ 


Figure 31 gives a graphical representation of the relation. 
Again the virtual absence of elasticity of supply might per- 
haps have been anticipated. The size of crop is largely determined 
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before the price is known and the crop must be disposed of re- 
gardless of price. It is to be noted, however, that this result came 
out quite independently of any such assumption.” There are other 
checks on the theory. Two of the correlations reported above 
have not been used. According to the diagram of relations 
Rag = 925 4pqg = --554 . The observed value, - 522, 
is in good agreement. Also A,, = —% 5s Apes +.479. 
The agreement with. the observed 
value of +.570 is not as good as in the previous case, but 
considering the small number of years, is not bad. 

The absence of elasticity of supply in the case of potatoes 
applies only within a single year. The fact that the supply is 
strongly correlated with the price of the preceding year +65/ 
indicates that in the long run there is considerable elasticity. The 
method of path coefficients readily lends itself to deduction of 
this long time elasticity. 

Let P, Q, A and B be the hypothetical averages of PQA 
and B respectively over an indefinite (7) period of years. The 
problem is to deduce the 
elasticities toward which 
the long time supply and 
demand curves tend, 
from knowledge merely 
of the correlations from 
year to year. The fol- 
lowing equation can be 





written from figure 32, 
where a, t, < and vm. = 


g are path coefficients pertaining to the paths indicated. 


7 In two other cases studied by this method (P. G. Wright 1928) very 
different results were obtained. In the case of butter, the elasticity of supply 
came out 1.43, of demand —.62. In the case of flax seed, the elasticity of 
supply came out even greater, 2.39, while that of demand was —.80. But 
these are cases in which a high elasticity of supply is to be expected on a 
priori grounds. It is interesting to note that in cases in which it seems 
justifiable to assume a priori that there js no elasticity of supply (e =o) , it 
follows that $, +0, %.° ‘ Ri: ep (still assuming % sp ©) and finally 





that 2- —& = 
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(06) Apq * 72¢2%,, = REGAL AtH A; ) 
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(67) ae = 7¢ fr og = 7t€ (RZ F-, 


a 
ay 






memset 
i-—e, 





© me # (5 Ag +5 &) = 












= Aa ¥ "3 aR? 7g (4, Xp DA a A 9 (% “5a * 2S, 2px) | 
ng (G2 pat SAMA) = ng A, g(a Pe - Bay 1 
(69) Az 5 = 9 Ais > 714 2, Agog (5 4,5 +s t-) 






. 299. (5 fr%s5 +s, &) = EBS 
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(71) Q = 2f = 9 


72 





Fe 
4 

Let (, and @, be the elasticities of long time demand and 
supply respectively. 


(72) Nn. ; Res Ca . (TaBS "Re RS, .\s Qc 
A = 


#25. /\ncese¢ 7g Op 
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ZQ Oy = 
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Thus a reaction of price of one year on the supply situation 
of the next does not tend to produce any difference between long 
time and short time elasticity of demand. It does make a differ- 
ence, however, in long and short time elasticities of supply. In 
the case of potatoes substitution of values already found gives 
e, = +.52 as the elasticity of the long time supply curve, insofar 
as determined by the reaction of the price of one year on the 
supply of the next. If it were legitimate to assume that there is 
no elasticity of supply within a year (e=o0), the formula for e, 
reduces to S$ % 2 aoy 


aP a - top 
Tests of Significance 

In considering the reliability of path coefficients there are two 
questions which must be kept distinct. First is the adequacy of 
the qualitative scheme to which the path coefficients apply and 
second is the reliability of the coefficients, if one accepts the 
scheme as representing a valid point of view. The setting up of 
a qualitative scheme depends primarily on information outside of 
the numerical data and the judgment as to its validity must rest 
primarily on this outside information. One may determine from 
standard errors whether the observed correlations are compatible 
with the scheme and thus whether it is a possible one, but not 
whether it correctly represents the causal relation. 

Having accepted a certain scheme with which the data are ' 
compatible, one would like to determine the reliability of the val- 
ues reached for the path coefficients. Obviously no single formula 
can be given, applicable to all cases. The basic formulae of the 
method are ones for writing series of simultaneous equations, 
which must be solved to obtain the unknown path coefficients and 
correlation coefficients. These equations are in general non-linear 
with respect to the unknown quantities, making it impossible to 
express the solution in a general formula in which substitution 
can be made in routine fashion. 

Certain principles can, however, be illustrated by the results 








SEWALL WRIGHT 205 


in simple cases. No attempt will be made here to deal with the 
complications due to small numbers. It will be assumed that the 
errors of sampling are in general so small in comparison with the 
values of the coefficients that second degree terms in the errors 
may be ignored. It is recognized that a more thorough treatment 
of the matter is much to be desired. 

The simplest set up (Fig. 33) is that in which one variable V, 
is represented as a function of another V, , and of a residual fac- 
tor V,. The equations are as follows: 


(74) 


(75) : . From (74) , 


2. 
i? Figs 
WV 


(76) 


From (75) 

(77) 2f, 4+ 2%, bp, =O, 
(assuming as noted above that dp, 

and op. are small compared with R, and PF, ). 


(78) Op = SP, 


Ou 


(79) Cs .. ee Ao (1-4-1) 
Th i i 


The standard error of the residual path coefficient in a system 
in which one variable is represented as deter- 
mined by a number of others (Fig. 34) may 
be derived similarly 


2 ! 2 [ te 
(80) %. * N 2 4¢12---m) as «gail . 
Consider next the case in which variable 


¥, is a function of two uncorrelated varia- 
bles V, and V, , and of residual factor ‘+ 
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Two different solutions are obtained for op depending on the 
point of view. If it is accepted that V, and V, are wholly inde- 
pendent, except for the accidents of sampling, 

we have 


(81) PR. = Ne, 


of 


Re 2 (1-2, 
Se 4-4 


If, however, there are no grounds for treat- 
ing V, and V, as independent, except that 7,, Fic. 35 
was insignificantly small in the data at hand, the proper set up 
is one in which a correlation between V, and V, is indicated 
as in Fig. 35. 
(83) 
(84) 
giving 
(85) 


Treating sampling errors as differentials 
(86) Sp- £ in (5%, Phos 9%, 4 .9%,,) +2 (2, a) ng ea 
_ 1-2,, )* 
In the present case, Z,, (but not d2,,) is assumed to be 
zero in the sample at hand. Thus 


(87) TR = 54a,,- Me 4% 


2. z 2 
(88) -., oe on, - Mey ore ~ BA, 2 "2, 25 ? 
where 7, »,,is the product moment of deviations of 2,, and x,,, 


(89) a Or, Via?) - Aer Sate (1m -2, a gt 2,7," 


by the formula of Pearson and Filoa. Again treating Z,, as 
negligibly small, 


(90) Mi Ct aii 3 


(91) - it O-2y) oat ~22%» (1-4, *)] 
(92) en 
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This is smaller than the value of . obtained on the assump- 
tion of independence of V, and V if Ae is less than 4, but 
larger for larger values of < . 

If the correlation between the two known factors V, and “ 
of figure 35, is not negligible, the squaring of the full formula 
for SP, and division by /V , leads after some reduction to the 
formula 
(93) op = = = “ RP (tar, -22%,,,,)] . 

A somewhat rough estimate of the standard errors in the 
analysis of birth weight of guinea pigs ,-page /77, can be made 
by this formula. The correlation between birth weight and size 
of litter was however based on larger numbers 
(3353) than the correlation involving gestation -51 
period (1317). Adopting the smaller numbers ‘3. go 
we find aa 


= = 6/ =. 02e 


8 


BL 


e.  & ee £.O04a2, Fic. 36 


While these estimates of the standard errors do not take cogni- 
zance of the approximation involved in substitution of gestation 
period for observed interval between litter (estimated 2,, =. 75) 
they are sufficient to indicate that the calculated path coefficient 
can be relied upon as accurate to a first order, assuming the cor- 
rectness of the set up. 

The standard error of a path coefficient has not been worked 
out for systems in which one variable is represented as affected 
by more than two known variables. The standard error of the 
closely allied concrete regression coefficient is however well 


known and can be used in testing significance. 
= 2 zu 
(94) Se = ee) aa 
eas Ph tae. m) 
z 
a: . » % ° . 
Since &, = G, +.  , the variance of the path coefficient 
} 


can be written 
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a 


> 
1-A2 


(95) o " sneer ,# s. can be treated 
or 4- AY (2--+-m) G, 


as constant. This probably gives fairly good approximation in 
any case and is so used by Brandt (1928). In the case of guinea 
pig weight discussed above, the correct formula gives a result a 
little smaller than this approximation. 

It will be noted that the standard errors may take very high 
values if the independent variable under consideration (v,) ap- 
proaches complete determination by the others in the system, i.e. 
if [1 - — approaches 0. In general, coefficients for paths 
leading from variables closely correlated with each other are sub- 
ject to large standard errors. In making up a system, whether 
for prediction purposes or interpretation the aim should be to 
select factors closely correlated with the dependent variable but 
as nearly independent of each other as practicable. 

If the dependent variable is completely determined by the 
specified factors (io ee ) the standard error of the con- 
crete partial regression coefficient becomes zero. This is not the 
case with that of the path coefficient. Thus in the two factor 
case discussed above 


2 BP O-2) 0-25 N28, 
(96) = {> 


or 


2 
1 22502) ~ / 
W(1-2%,) F Mocs) 


More generally, if C,, can be treated as constant (as it can 


a 


if ocia...m)=!)> 


- bee all G (I-2e, "de 
(97) 0. = << O(a) = Gj oe? or C aol 


Nv N 


which is in agreement with the preceding result. 

Another simple set up, which is of interest is that in which 
three variables are arranged in chain sequence ( a ae 
vena again the point of we “i a , Vv, u 
difference. If the above relation is mere- rr f 
ly an empirical one, the situation is mere- Fic. 37 
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ly a special case of that just discussed (the case in which c. 
ont F& + &, } By wena we find 


(98) P = -,,) 
(99) s 4 apa 


(100) 2 [ oe 


' 
a} 

If, however, V, is represented as the sole intermediary between 
V, and Vy on theoretical grounds, the result is different. Two 
different determinations can be made of c and of Tp, , the 
reason being that more equations can be Shen than there are 


unknown path coefficients. From, | =Jt,,, 


2 


(101) -t(i-2 
= ie % 
a —Jk., VG es. } 


Fg 


am 


(103) £G-A~)y 


(102) ~ 


z 2 
(04) ew he etn UU“ * 2) 
fn v oy 2 


Similarly two determinations gan be made of Op 
° 


12 
z 


- 2 
(105) Sp = - [ i- “a . From 


o12 


(106) c= + [a-Aany- G-%, )O-2,)]. 


012 
With standard deviations calculated from two independent 
sets of data in each case, a combination estimate, smaller than 
either can be obtained from the formula 


(107) 


/ 
oa 
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This illustrates the important principle that where there is a su- 
perfluity of equations for determining the path coefficients, the 
standard errors of these are correspondingly reduced. In the 
analysis of corn and hog correlations 42 path coefficients were 
found with which 510 correlations (and 4 cases of complete de- 
termination) were in agreement to the extent expected from their 
standard errors. Calculation of the standard errors of the path 
coefficients in this system seems out of the question, but it may 
safely be assumed that values of the order a , which might be 
based on 42 equations are to be reduced by considerable amounts 
by the superfluity of data available. 

There are some interesting contrasts in the standard errors 
given above. If 2), is large, Tp, may be large in the empirical 
system. But if the theory that X, is the only intermediary rests 
on adequate grounds, independent of the observed correlations, 
Tp may be small with large 2’, . 

We will conciude with consideration of a set up like that used 
for the relation of supply and demand to price and quantity. It 
will be assumed first that the number of cases is large (a condi- 


tion contrary to that found in the D 4, 
examples given). Differentiation P 
of the 5 basic equations gives 5 e 


equations expressing the relations C .. 
between small deviations of the ~“ 

path coefficients and correlations. Fic. 38 

(108) #4 47=/ (113) 2p dIptrap dp =0 
(109) 3, + i =f (114) 2 g, og, 2 B Jog, - oO 
(110) 43,+43,=2 re (115) 42 13, If +p 59+ 9, dp2+ Rpg 
(111) ~.5 > pp (116) Fz, 9s + sdp = I Ag» 
(112) 9,5> 2a (117) G, oS + sdg = o Boa: 
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Thus 
(118) 2B In 


(119) ce «~- & “3 
(120) 44, (B * . +7) t A (3. - PB, 2 3.) . J Ags 
(121) lk ~th ps Oy - Bie 


Bp" 


Solution of (120) and (121) as incites equations gives 
expressions for J, and dg, in terms of DA, J254 > IX, 0, 
from which their squared standard errors can be found by taking 


the average squares. Letting 4+, (4 and C be the coefficients, 
og = J) 
(122) i> 4 da,.rB daggr C da 


2 z 2 > Lr 
(123) @ . Ag +Bae +Co +2AB mM 
$x A rq Ga gp 72 pe 23a 


+2 AC ~~ , +a BC m7 
Pa BP BQ BFP 

The product moments of the deviations of the correlation co- 
efficients can be found by Pearson & Filon’s formula cited on 
page 206. 

The standard errors of J; %, and 0 g, can be found at once 
with the help of equations (11) and (12) while that of Js can 
be found from (9) or (10) after expressing J (or 63.) in 
terms of deviations of the known correlation coefficients. 

The significance of the coefficients of elasticity is most easily 
investigated by taking these on scales in which the standard errors 
of the percentage deviation in price and quantity are taken as 


unity i.e. by _— the a error of 5 and of 2 instead 
b: = and X> ae - respectively. These standard 
errors can be found from the formula for the standard error of 
a ratio. . aaa a -_ 
(124) Tk. & [| + + Bi 2 Mr 
$*FALe FF 3.7 
The product moments of the path coefficients can be obtained 


of e= 


, 
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by squaring equation (8) after expressing 34 and J9, in terms 
of Jp and dg, equation (13) or the converse. 

The numbers of cases in the actual examples were not large 
enough to make the method a satisfactory one. The calculations 
have been carried through, however, with the results given below. 


Summer Pork Winter Pork Potatoes 
26 Years 44 Years 19 Years 


63 & ./2 68 .08 — 85 
a7 16 Os Og 


OH Az _ 83 05 


+ // 
=F —§2. 
The most nearly satisfactory case is that of winter pork based 


on rather large primary correlations obtained from 44 years’ ex- 
perience, but even here, the standard error of g, is nearly as 
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large as 4 itself. In the other cases, the standard error of g%, is 
larger than g . The term dg” omitted in equation (7) is of the 
order of the term 2%, 53, or larger, making this equation invalid. 
The other equations are not affected, at least to anything like as 
great an extent. An approximate solution can be obtained even 
though equation (7) is omitted, from the consideration that in 
this case in which :. is small, 4; %. must be very small and may 
be ignored. Thus 


(125) On 


BQ 
, A. 
(126) [in - = J254] 


(127) AL F2 - 3 9F] 
(128) -- Bon. 


The results are substantially the same as those obtained above, 
since the values assigned 4; $» were very small, even if not reliable. 
It may be safely concluded that winter pork at the large markets 
has very little elasticity of supply but a moderate elasticity of 
demand. The results for summer pork and for potatoes are in 
harmony with similar interpretations but are based on such inade- 
quate numbers as to have little significance in themselves. 


REFERENCES 

Brandt, A. E., 1928—Calculation and use of the standard deviation of par- 
tial regression coefficients. Iowa St. College Jour. Sci. 2: 235-242. 

Burks, B. S., 1928—The relative influence of nature and nurture upon men- 
tal development; a comparative study of foster-parent—foster-child 
resemblance and true parent—true child resemblance. 27th Yearbook 
of Nat. Soc. for Study of Education, 1928, Part I:219-316. 

Calder, A., 1927—The role of inbreeding in the development of the Clydes- 
dale breed of horse. Proc. Roy. Soc. Edinb. 47: 118-140. 

Fisher, R. A., 1925—Statistical methods for Research Workers. 239 pp. 
Oliver & Boyd. Edinburgh. 

Jennings, H. S., 1916—The numerical results of diverse systems of breed- 
ing. Genetics 1: 53-89. 

Kelley, F. L., 1927—Statistical Method. 390 pp. The Macmillan Co. New 
York. 

Krichewsky, S., 1927—Interpretation of Correlation Coefficients. Ministry 
of Public Works. Egypt. Physical Dept. Paper No. 22, Cairo. 











214 THE METHOD OF PATH COEFFICIENTS 


Lush, J. L., 1930—The number of daughters necessary to prove a sire, 
Jour. Dairy Sci. 13: 209-220. 

1932—The amount and kind of inbreeding which has occurred in the 
development of breeds of livestock. Proc. 6th Internat. Congress of 
Genetics 2: 123-126. 

McPhee, H. C., and S. Wright, 1925—Mendelian analysis of the pure breeds 
of live stock. III. The Shorthorns. Jour. Hered. 16: 205-215. 

1926—Mendelian analysis of the pure breeds of live stock. IV. The 
British Dairy Shorthorns. Jour. Hered. 17: 397-401. 

Minot, C. S., 1891—Senescence and rejuvenation. Jour. Physiol. 12: 97-153. 

Niles, H. E., 1922—Correlation, Causation and Wright’s theory of path 
coefficients. Genetics 7: 258-273. 

1923—The method of path coefficients, an answer to Wright. Genetics 
8: 256-260. 

Smith, A. D. B., 1926—Inbreeding in cattle and horses. Eugen. Rev. 14: 
189-204. 

Wright, P. G., 1928—The tariff on animal and vegetable:oils. 347 pp. The 
Macmillan Co., New York. 

Wright, S., 1918—On the nature of size factors. Genetics 3: 367-374. 

1920—The relative importance of heredity and environment in deter- 

mining the piebald pattern of guinea pigs. Proc. Nat. Acad. Sci. 6: 

320-332. 

1921a—Correlation and Causation. Jour. Ag. Res. 20: 557-585. 

— —1921b—Systems of mating. Genetics 6: 111-178. 

————1922 —Coefficients of inbreeding and relationship. Am. Nat. 56: 330- 
338. 

1923a—The theory of path coefficients—a reply to Niles’ criticism. 

Genetics 8: 239-255. eae 

1923b—Mendelian analysis of the pure breeds of livestock, I. The 

measurement of inbreeding and relationship. Jour. Hered. 14: 339-348. 

II. The Duchess family of Shorthorns as bred by Thomas Bates. Jour. 

Hered. 14: 405-422. 

1925a—Corn and hog correlations. Bull. No. 1300, 60 pp. U. S. Dept. 

of Agric. 

1926a—A frequency curve adapted to variation in percentage occur- 

rence. Jour. Amer. Stat. Assoc. 21: 162-178. 

1926b—Effects of age of parents on characteristics of the guinea pig. 

Amer. Nat. 60: 552-559. 

1927 —The effects in combination of the major color-factors of the 

guinea pig. Genetics 12: 530-569. 

193la—Statistical methods in biology. Jour. Amer. Stat. Ass. Sup- 
plement. Papers and Proceedings of the 92nd annual meeting. 26: 155- 
163. 

——1931b—Evolution in mendelian populations. Genetics 16: 97-159. 


1932a—On the evaluation of dairy sires. Proc. Amer. Soc. Animal 
Prod. 1932: 71-78. 












































SEWALL WRIGHT 215 


1932b—General, group and special size factors. Genetics 17: 603-619. 

Wright, S., and H. C. McPhee, 1925—An approximate method of calculat- 
ing coefficients of inbreeding and relationship from livestock pedigrees. 
Jour. Ag. Res. 31: 377-383. 

Wright, Sewall, 1933a—Inbreeding and homozygosis. Proc. Nat. Acad. Sci. 
19: 411-420. 

——1933b—Inbreeding and recombination. Proc. Nat. Acad. Sci. 19:420- 
433. 


Leth Whryd 











MATHEMATICAL FOUNDATION FOR A METHOD 
OF STATISTICAL ANALYSIS OF HOUSEHOLD 
BUDGETS 


By Joun W. Boipyrerr 
Harvard University 


The object of this paper is to offer a satisfactory method of 
statistical analysis of household budgets in accordance with the 
general principles of mathematical logic. I have, therefore, taken 
these words of Fourier: ‘Mathematics has no symbols for con- 
fused ideas’”* as my guiding light, and set out to effect a simple 
and comprehensive analysis of the general type of statistical data 
which is included ‘under the heading “household budgets,” i.e. 
monetary incomes and expenditures of these incomes. 

I have tried to lay the greatest stress, accordingly, on the 
clarity and terseness of the exposition rather than inclusiveness, 
attempting to diminish to the utmost the number of undefined 
ideas and the undemonstrated propositions. I make no special claim 
to originality and base my method upon the works of numerous 
previous investigators, summarizing analytically old principles and 
ideas on the bases of mutual consistency and reducibility to more 
fundamental principles. This paper is specially framed to relieve 
the feeling of intellectual discomfort which of late has been trou- 
blesome to conscientious investigators in our field, so overcrowded 
with revelations of numerous parts, rather than with indications 
of the mode of combination of the major components within the 
whole. I address here the properly instructed mind and so dis- 
pense at times with the elaboration of some statements. 

In this summary, therefore, we shall be concerned with laying 
down a rigid method for analysing the budgetary data, defining 
their scope formally to include only the monetary incomes and 


1 Quoted from J. A. Schumpeter, Die Wirtschaftstheorie der Gegen- 
wart, Wien, I, 11, 1927. 
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the relative amounts of these incomes spent in a defined manner. 
This, naturally, excludes all reference to economic theory (e. g. 
utility, demand curves, etc.) from our discussion; I do so not 
because of a desire to depreciate the importance of that kind of 
belief, but because I do not wish to consider it here. 

Obviously, “it is never a mathematical proposition which we 
need, but we use mathematical propositions only in order to infer 
from propositions which do not belong to mathematics to others 
which equally do not belong to mathematics.”? Moreover, it is 
also true that nothing can be purely logical or mathematical (un- 
less we follow Hilbert and define mathematics as a game with 
meaningless marks on paper) ; all propositions involve some psy- 
chological terms such as defining, meaning, asserting or naming. 
The method and scope of a mathematical analysis is in a like 
manner dependent on the purpose for which it is to be undertaken. 

The purposes in the study of budgetary data assume varying 


emphasis depending on the point of view of approach, that of 
economics, home economics, social welfare, and sociology.* All of 
these approaches are concerned with the relation between the sizes 
of incomes and the relative amounts spent for certain goods and 
services. 


Generally, the classification of expenditures of an income is 
made as to the amounts (or proportions) spent for food, clothing, 
rent, light, education, health, recreation, savings, and amusement. 
Some investigators limit their classifications to five items: food, 
clothing, rent, fuel and light, and sundries (everything not in- 
cluded under the first four). Others prefer to subdivide the 
classification further and break up each of the above nine types 
of expenditure into what they deem to be its component parts, 
and proceed to study these new relationships and to generalize 
from them. On my part, I judge the latter performances ex- 


2 Wittgenstein, Tractatus Logico-Philcsophicus, 6, 211. 
$C. C. Zimmerman, Am. J. Soc., vol. XXXIII, 6, 1928. 
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tremely dangerous. It seems to me, that the analysis of the major 
components of income’s expenditure in their relationships to the 
size of income and to each other should be developed and per- 
fected beforehand, and then only gradually extended to apply 
to the minor items. Moreover, the splitting up of a few variables 
(the types of expenditures) into many introduces other difficulties 
—aside from the fact that a study of simple relationships is apt 
to be more clarifying—the introduction of a component part of 
the whole variable as a new variable, immediately raises the ques- 
tion why this component is isolated and not the other. None of 
the arguments that can be generally cited (and usually no argu- 
ments are cited) are really decisive, and the position is extremely 
unsatisfactory to anyone with real curiosity about the fundamental 
relationships. Unless we wish the atalysis of the budgetary data 
to remain self-contradictory and meaningless, we must adopt a 
limiting method, and study not more than two variables at a time. 
Then, and only then, can we hope to establish or discover any 
“laws,” or functional relationships. 

In my experiments to develop a satisfactory method of anal- 
ysis I would begin generally with five classes of expenditures: 
food, clothing, rent, fuel and light, and sundries. Later, I have 
come to the conclusion that some of these tend to have a sort of 
complementary relationship between them. Thus, “fuel and light” 
are often higher or lower with a higher or lower “rent,” and in 
some cases a part of “rent” covers “fuel and light,” in other cases 
the discomfort and monetary cost of “fuel and light” lowers the 
“rent” expenditure. Likewise, some complementary relationship 
is observed between “fuel and light” and “clothing” (especially 
in submarginal households) and between “clothing” and “rent” 
(e. g. social demand of the stylish residential district). These are 
merely a few examples which led me to question the validity of 
initial isolation of these three items (clothing, fuel and light, 
and rent) from each other. Accordingly, I suggest to limit our 
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investigation to the study of possible relationships between: (1) 
the size of the income and (a) the amount spent for food, (b) 
the amount spent for sundries; (2) the amount spent for food 
and the amount spent for sundries—assuming temporarily for 


convenience and analysis all other itemized expenditures under 
rent, clothing, and fuel and light, to be not subject to individual 
isolation. 

As to the unit in household budget, the variety of units 
employed bewilders at first a mathematical student. Of these, 
the old scale of two children for one adult, the various other 
“adult equivalents” (e.g. Engel’s quet scale of 3.0 for woman 
of 20 and 3.5 for man of 25 years; Atwater’s scale of 10, 8, 7, 5, 
2.5; then the scales of Voit, U.S.D. of L., H. C. Sherman and 
L. H. Gillett, G. Lusk, L. Emmett Holt, and others—each scale 
giving “adult equivalents” for children, male and female), all 
clearly show inability of investigators to agree on a scale to 
determine the size of a family in standard units. It seems to me 
that the inventors of such scales forget somehow that “taking 
an arbitrary individual in the living nature—a man, an animal, 
a plant—it will generally be found impossible to find out another 
individual in all respects identical to the first one chosen.”* The 
standard scale in budgetary studies is less valid than usual statis- 
tical abstractions, for such factors as geographic space (climate, 
nutritive ratio, energy value, cost), social space (stratification and 
differentiation), economic space (size of incomes), occupational 
space (caloric requirement, etc.), time factor (daily, weekly, 
monthly, seasonal, and longer fluctuations), as well as age (for 
there is a great latitude in “adult” ages and a corresponding 
variability in “requirements”) and sex differences, are admittedly 
affecting each budgetary individual in a variety of unknown ways. 
In view of the complexity of the problem and the enormousness 
of human population, any “adult equivalent” scale will appear 


4C. V. L. Charlier, Acta Universitatis Ludensis, 1905-6, XVI, 5, p. 3. 
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to be based on samples obtained in gross violation of the sampling 
theory, for it is very doubtful that a sufficiently large and rep-~ 
resentative sample can be secured and it is very hard to see how 
it can escape being greatly biased. Besides, most of these scales 
are based on energy requirement only, and refer to “food” but 
not at all to other types of expenditure; therefore, they would be 
of little general significance even if they were valid in their spe- 
cific aspect. Personally, I must reject all such scales as meaning- 
less and incline to hesitate between adopting a “normal family” 
(on basis of a standard number of members, irrespective of their 
characteristics) and a “household” (irrespective of number of 
members and of their characteristics), the presumption being that 
in a sufficient random sample the differences either way will tend 
to cancel out. This may not seem to be a more accurate method 
than others, but, in all probability, it is just as accurate, and its 
virtue lies, moreover, in the fact that its limitations are all on the 
surface instead of being hidden away behind a misleading label. 
The data of the last Census seem to favor this attitude.® 

The purpose of budgetary analysis is to discover, allegedly, 
certain functional relationships, if any, between the varying in- 
come and the relative amounts of each type of expenditure. To 
discover such relationships and to determine them explicitly one 
must recognize that all laws logically function within limits. One 
needs not go as far as Hilbert and insist that anything involving 
an infinity of any kind must be meaningless—in pure mathematics 
this may be a useful abstraction—but it should be obvious that 
in all organic laws anything infinite appears a stupid fiction which 
cannot be argued for except by proceeding to a limit. The be- 
havior of the budgetary items is clearly a biotic phenomenon 
which fact some of the ifivestigators in our field tend to overlook 
consistently. If there are any functional relationships in the bud- 


SL. E. Truesdell, New Family Statistics for 1930, J. Am. Statist. Assn., 
March 1933 (Supplement), pp. 154-8. 
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getary data these will be found only within definite limits of 
minimum and maximum, and any contradicting evidence to such 
laws if found below or above these limits cannot be interpreted 
as disproving such laws. 

We shall make our points clearer by illustrating the above 
exposition by the so-called Engel’s Law (I am referring to the 
second part of it), incidentally commenting briefly on its validity 
and demonstrating the details of our method. 

It will not be amiss to formulate in a few words the part 
of Engel’s Law (1895) we shall be concerned with in our dis- 
cussion. Comparing the incomes of laboring families, middle class 
families, and well-to-do families, Engel conjectured that: 

(1) the greater the income, the smaller the percentage 
of outlay for subsistence (food), 

(2) percentage of outlay for clothing is approximately 
the same, whatever the income, 

(3) percentage of outlay for rent, and for fuel and 
light, is approximately the same, whatever the income, 

(4) as income increases in amount, the percentage of 
outlay for sundries becomes greater. 


Most of the investigators incline to accept the first and the 
last of Engel’s propositions, both from the static and dynamic 
viewpoints. As for myself, I like to consider this law with refer- 
ence to the following questions: 

(1) as incomes increase does the percentage of outlay 
for food decline and the percentage of outlay for sundries 
increase? 


(2) is this a static law; i.e. in a given place, at a 


given time, will there be a higher percentage of outlay 
for sundries and lower percentage of outlay for food 
with larger incomes, and vice versa for smaller incomes? 


(3) does this hold in the dynamic aspect—as incomes 
increase (in time) do the percentages of outlay for food 
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decline and those for sundries rise, for short and long 
time ? 

(4) is this law reversible, i.e. if incomes decrease do 
the percentages of outlay for food rise and those for 
sundries decline, statically and dynamically? 


(5) can the percentages of outlay for clothing, rent, 
and fuel and light be treated as constant, statically and 
dynamically ? 

(6) can this law be interperted to mean that when the 
percentage of outlay for food declines the percentage of 
outlay for sundries rises, and vice versa, statically and 
dynamically ? 

(7) if this law is valid, what is its significance for 
forecasting ? 


Let us consider first the problem of limits from a purely 
abstract viewpoint. We assume for the sake of argument this 
law to be valid and set up a hypothetical series of incomes with 
the respective percentages and amounts of outlays for food and 
for sundries. The following example shows clearly that a limit 
is eventually reached when the law becomes automatically in- 


operative. 
TABLE I. 

Income in $|% for Food % for Sundries| $ for Sundries 

Under 900 — 
” 1,000 100 
” 2,000 300 
” 3,000 600 
” 4,000 1,000 
” 5,000 1,500 
” 6,000 2,100 
” 7,000 2,800 
” 8,000 3,600 
” 9,000 4,500 
” 10,000 5,500 
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Aside from demonstrating the inevitableness of limits, this 
illustration shows also that from purely common sense considera- 
tions constancy of interrelationship between variation of percent- 
ages for food and percentages for sundries is not feasible. That 
the absolute amount spent for food cannot decline with increase 
of income but should constantly keep on rising (though, perhaps, 
in small amounts), should be clear from common sense, even if we 
shall consider this amount as stationary after a certain sum is 
reached and credit the increase to sundries (cooks, maids, travel, 
eating out, etc.)—yet, even in such cases decline should be out of 
the question. 

Now we can give an illustration of the validity of assumption 
that the percentages of outlay for rent, clothing, and fuel and 
light, for convenience of analysis and until proven to be contrary, 
can be held constant. We have tried this with a variety of data 
and generally found this to be true. 


TABLE II. 


COMPARISON OF THE PERCENTAGES OF THE TOTAL FAMILY EXPENDITURE 
FoR THE DIFFERENT Groups or Livinc Costs® 


Eden’s 73\ Engel’s| Le Play’s |U.S.D.L. | U.S.D.L. | Groton, 
English | Belgian 
Item Budgets Data 
1796 1853 


Food 73 
Rent 12 
Clothing 7 
Fuel and light 

Sundries 3 


Adding the “rent” and “clothing” items from Table II we 
obtain: 19.0, 22.5, 23.3, 30.4, 30.0, and 24.4; by adding to these 
their respective “fuel and light” items we obtain: 24.0, 28.1, 


® Taken from Noble, Cornell University Agricultural Experiment Sta- 
tion Bulletin, # 431, Sept., 1924. 
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27.6, 36.3, 35.3, and 31.2. It seems justifiable to assume these 
items in their summation to be a constant factor in time analysis. 
That they are constant for static analysis will be shown later. 
But it may be mentioned in passing that taking the data from 
Noble’s Table 19 (average percentages of expenditure of items 
of cost of living of 518 families in New York City, by income 
groups)’ and adding up our “constant factor” we get 35.9 for 
the lowest income group and 36.4 for the highest. 

One illustration more. Below are the figures taken from 
the U.S. B.L,., 18th annual report, 1904, p. 101. 


TABLE Hit. 















300 


" 400 16.09 
“i 500 16.50 
“ 600 17.20 
5g 700 19.39 
“ 800 21.63 
“4 900 23.02 
= 1,000 23.21 
o 1,100 23.69 
m 1,200 26.13 
1,200 and over 





The “constant factor” taken at the lowest and highest incomes 
is found to be 33.57 and 36.15 respectively. The examination of 
the table from the point of view of finding a law, or functional! 


relationship, reveals such phenomenon for the range of incomes 
from $500 to $1,200, inclusive. We shall proceed to examine the 
data included in these limits in accordance with our method. 

We find the “constant factor” for $500 income to be 36.62 
and for $1,200 income, 36.19. 


T Op. cit. 
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We assumed a straight line relationship and computed simple 
coefficients of correlation between: 
(1) incomes and percentages of outlay for food 
(2) incomes and percentages of outlay for sundries 
(3) percentages of outlay for food and those for sun- 
dries 


We want to stress in this connectoin that to us the coefficier* 
of correlation means a measure of relationship which is already 
empirically established, not a proof of such relationship. We 
used L. P. Ayres* formula which we found convenient for com- 
puting purposes. To avoid a fictitious correlation between in- 
comes and the percentages of outlay for food and for sundries, 
we have divided the income column by a constant. To facilitate 
computation we have likewise divided the “percentages for food” 
and the “percentages for sundries” columns by constants. 

In making a summary comment on Engel’s law, I would 
like to stress the following points from a purely methodological 
viewpoint. There seems to be definite evidence that in a given 
place, at a given time, the law holds consistently within certain 
limits. For very low income groups some other law may hold, 


or no law at all, and as to how extremely large incomes are spent 


we do not know. From the dynamic aspect, the law appears to 
have been working from the time of the French revolution up to 
the beginning of the present depression (much evidence could be 
cited to support this fairly well known fact, e.g. works of Schmol- 
ler, Rogers, D’Avenel, U. S. B. L. S. Bulletins, etc.). However, 
the study of W. A. Berridge (The need for a new survey of fam- 
ily budgets and buying habits, N. Y. Times, May 10, 1931, and 
“The Annalist,” July 17, 1931) seems to indicate that from the 
secular standpoint this law is not immediatly reversible, for with 
the shrinking incomes we observe a definite decline in the outlays 


8 J. Educ. Rescarch, 1, March-June, 1920. 
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for all items, including food, except for the outlay for sundries 
which appears to be almost stationary. 

That percentages of outlay for clothing, rent, and fuel and 
light, can be added up and treated as a constant factor both stat- 
ically and dynamically with rising incomes we can be reasonably 
certain of; what will happen with decreasing incomes in time 
analysis we are not ready to say. However, it must be borne in 
mind that even with the rising incomes the relationship between 
the percentages of outlay for food and for sundries need not be 
perfect as one may be led to think from their high individual 
coefficients of correlation with income in the example given above. 

As to a practical application of the budgetary analysis to 
forecasting, I shall venture to say that in a socially planned society 
(if such society is workable), the study of itemized expenditures 
may prove invaluable. In other societies it may be used to forecast 
some sort of consumption indices—if these will be successfully 
computed they will undoubtedly help to flatten the curve of 
business cycles to an appreciable degree. As to how to develop 
these indices, I have no suggestion to make just now, except that 
it must be on basis of extension of a crude analysis similar to 
one offered here, and application of probability technique, properly 

#based on psychological and historical findings. All I hope to 
have made clear in this paper is that the subject is very difficult, 
and that an analysis offered here is sufficient as a first step. 

In conclusion, I must stress my indebtedness to Professors 
J.D. Black, J. A. Schumpeter, and C.C. Zimmerman for advice 
and suggestions. I am also grateful to Professor Zimmerman for 
the materials he let me examine. But above all I am indebted 
to Professor W.L. Crum from whom my point of view and ' 
method of attack ate wholly derived; anything of value that I 
may have said in this paper is due to him. 


y-<. 


ON THE RELATIVE STABILITY OF THE MEDIAN 
AND ARITHMETIC MEAN, WITH PARTICULAR 
REFERENCE TO CERTAIN FREQUENCY DISTRIBU.- 
TIONS WHICH CAN BE DISSECTED INTO 
NORMAL DISTRIBUTIONS’ 


By 
Harry S. Po.iarp 


a 


THE CHOICE OF AN AVERAGE 


In any statistical investigation in which an average is to be 
used as a summarizing figure for a frequency distribution the 
question arises, which average best describes the distribution. 
That this is still a debatable question among writers on economic 
statistics is shown by a perusal of the many papers dealing with 
the measurement of seasonal variations which have appeared in 
recent years.” 

Each of the proposed methods of isolating seasonal variations 
involves an averaging, either of monthly items or of relatives of 
monthly items, but whether this averaging is best accomplished 
by use of the arithmetic mean, the median, or the mean of a 
middle group of items seems to be a moot point. Persons* employs 
the median of link relatives of monthly items, since by this device 
the influence of large non-seasonal variations may be greatly 
moderated. Hart* in justifying the use of the arithmetic mean 


has shown that the method of monthly means gives the actual 


*A resume of a dissertation, bearing the same title, written under 
the direction of Professor Mark H. Ingraham and submitted in partial 
fulfillment of the requirements for the degree of Doctor of Philosophy in 
the University of Wisconsin, 1933. 

*For a bibliography of literature on this subject see Mills, F.C., 
Statistical Methods, p. 343. 

* Persons, W. M., Correlation of Time Series, Jour. Amer. Statis. Assn., 
June, 1923, p. 717. 

*Hart, W. L., The Method of Monthly Means for Determination of a 
Seasonal Variation, Jour. Amer. Statis. Ass’n., Sept., 1922, pp. 341-349. 








228 RELATIVE STABILITY 





monthly values of the seasonal variation in case the seasonal 
variation is strictly periodic throughout the period of years under 
consideration and the long term variations are also periodic 
with integral numbers of years as their periods. The proof of 
this theorem is based on a property of Fourier series discussed 
by Bocher’*. 

The point of view of this paper is that another factor of 
importance should influence the choice of an average, that this 
choice should be guided not alone by consideration of exceptional 
cases which may arise, nor by theory which assumes a periodicity 
seldom found in sequences of economic data, but also by a con- 
sideration of the stability of the averages. For if a given fre- 
quency distribution is regarded as a random sample drawn from 
a theoretical distribution which contains a very large number of 
items, the accuracy with which a particular average of the sample 
will typify the entire theoretical distribution is influenced by the 
frequency curve for that average. It is the purpose of this paper 
to compare the stability of the arithmetic means and medians of 
frequency distributions which may be dissected into two and 
three norma! distributions, and to develop a generai method of 
comparing the relative stability of the mean and median which 
shall be applicable to any frequency distribution. 

The dissection of a frequency curve into two normal compo- 
nents has been discussed by Karl Pearson*, who has developed 
methods for determining the values of the parameters of both sym- 
metrical and asymmetrical frequency functions. He has applied 
these methods to distributions of cranial weights. Crum’ has used 
Pearson’s method of dissecting a symmetrical distribution in his 
discussion of the relative stability of the median and mean of link 


* Annals of Mathematics, Second Series, vol. 7, p. 135, Formula (63). 

* Pearson, K., Contributions to the Mathematical Theory of Evolution, 
Philosophical Transactions, Series A, vol. 185, 1894, pp. 71-110. 

"Crum, W.L., The Use of the Median in Determining Seasonal 
Variation, Jour. Amer. Statis. Ass’n., March, 1923, pp. 607-614. 
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relatives of monthly figures for the rate of interest on sixty to 
ninety-day commercial paper for the years 1890-1917, and his results 
are discussed in section VI of this paper. Our interest in an asym- 
metrical distribution composed of two normal distributions arises 
from the fact that such a distribution affords a good fit both tu 
distributions which possess two distinct modes, and to skewed dis- 
tributions with one mode. The study of a distribution which may 
be dissected into three normal components is suggested by the occur- 
rence in economic data of tri-modal distributions. This paper will 
be concerned with only a particular class of three-component dis- 
tributions, those which are symmetrical. 

The hypothesis from which this investigation started was that 
a good criterion for measuring the stability of an average is its 
standard deviation. However, a difficulty which soon presented 
itself was the accurate determination of the standard deviation of 
the median. The classical formula for expressing the standard 


deviation, 6, , of the medians of samples of s items each, drawn 


from a frequency distribution whose equation is ¥ - f(x) and 


which satisfies the condition: 


° ia | 
[fe 4x: B+ [fooax 1s g, = 2Vs $0) 


The approximation to the value of the standard deviation of the 
median given by this formula is discussed in section 1V, where it 
is shown that, although this approximation is close to the true 
value of the standard deviation of the median when § is large, it 
may be a very poor approximation when S is small, particularly 
for certain types of frequency curves. 

Since it became obvious that the relative stability of the medi- 
ans and arithmetic means of small samples cannot be determined by 
the methods which are valid when the samples are large, this paper 
resolved itself into two distinct investigations: a treatment of 
certain frequency functions using the classical formula for the 
standard deviation of the median, valid for large values of 5S ; 
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and the development of a second method of comparing the stability 
of the arithmetic mean and median which may be applied also when 
S is small. The first of these topics is considered in sections II 
and III, the second is taken up in sections IV and V, and is math- 
ematicaliy the more interesting part of the work. In section VI 
the various methods of comparing the stability of the arithmetic 


mean and median are applied to a particular sequence of economic 
data. 


If. 


THe RELATIVE MAGNITUDE AND STABILITY OF THE ARITHMETIC 
MEAN OF A FREQUENCY DISTRIBUTION WHICH Is COMPOSED 
oF Two NorMAL DISTRIBUTIONS. 


1. The Mean and Median and Their Standard Deviations. 
In this section a study will be made of the frequency function 


whose equation i 2 
qd x* _ (x-6) 


1 zx 20,2 x. 267 

(1) * 7a ie e . e ' . 
with the purpose of determining the influence of the five parameters 
of this equation upon the location of the mean and median of the 
distribution, and upon the standard deviations of these averages. 

The only conditions imposed upon the parameters ©, ©, 5, 
G > & are that they shall assume only positive values (since they 
represent, respectively, the areas of the two component curves, their 
standard deviations, and the distance between their arithmetic 


means), and that the first two parameters shall satisfy the equation 
) +t =f, on 
so that the total probability, as represented by Sy dx , Shall 
be unity. ~ 
The arithmetic mean, X , of the distribution may be expressed 


as a function of the parameters by the equation 


(3) x= £2¢. 
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The median, ™ , of the distribution satisfies the equation 


rc 


eo x Get)" 
i. (<= e ; Ma. 
pees +s" = 


and can in general be located only by interpolation in a table of 


(4) 


areas under the normal curve. This interpolation can be more easily 
performed 1f equation (4) is transformed into 
Mg ae ¢-" 
- % 


(5) t, |e * de « 4 | Fee. 


° 
In distribution (1), 


of the component distributions, measured from the means of the 


and 6, denote the standard deviations 


respective components. Hence, the standard deviation, 0 , of the 
entire distribution satisfies the equation 


C= la (G +X") + 4, 0G5°+16-z)) | 


Therefore the value of the standard deviation of the arithmetic 
means of samples containing 5S items each drawn from distribution 


1) is 
x ‘ a a, 


If we assume that 5, the number of items in the sample, is 
sufficiently large to justify its use, an approximation to the standard 
deviation of the median may be obtained from the equation 


——— ee ee 
(7) 6, 2 y, VS where 4.° wee 


2. The Relative Magnitude of the Mcdian and Mean. 

From equations (3) and (5) it is seen that, if four of the five 
parameters, ¢ ,¢,, 9, %, 6, are fixed and the fifth is allowed 
to vary, both X and /Y will be monotone increasing functions of 
& and of 2, , and monotone decreasing functions of ¢, , and that 
X is independent of the standard deviations of both components, 
while 7 is a monotone increasing function of 9, and a monotone 
decreasing function of ¢, . 








232 RELATIVE STABILITY 


When 4, =k, and a 
distribution (1) becomes symmetrical, and 
X=M=£. 
To obtain conditions under which X shall exceed M , let equations 
(3) and (5) be differentiated with respect to&. The inequality 














dx. al 
ae? dt 
may be reduced to the form 
™m* C#- mM) 
ae ae 
G ~ > G S 
It follows from equation (5) that when <,> <, , 
i a) 
&-M M a6, 20% 
z > > > whence e€ >e 
4 
Hence the inequalities 
(8) o>4, » GAG 
are a sufficient condition that a shall exceed a , and since 


Xx = /4¥=o0 when €:¢ , inequalities (8) are sufficient to insure 
that for positive values of @, X will exceed M . 

In the case of many frequency distributions whose form sug- 
gests dissection into two normal components it is found that the 
standard deviations of the smaller component exceeds that of the 
larger component. Hence, condition (8) is fulfilled, and X differs 
more from the mean of the larger component than does ™. 


qe 
u 
°o 


3. Relative Stability of Median and Mean for the Special Case, 


From equations (6) and (7) it is seen that while 6> and Sm 
are both monotone increasing functions of &, they do not possess 
a monotone character with respect to the other parameters of equa- 
tion (1). The development of general conditions which the param- 


eters must satisfy in order that 6; may exceed Sy is impeded by 
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the fact that /7 is defined in (5) by an equation containing integ- 
rals, and its numerical value, for given values of the parameters, 
can be obtained only by interpolation in a table of areas under the 
normal curve. We shall therefore determine the relative stability 
of the median and arithmetic mean, as measured by the standard 
deviations of these averages, for certain special cases of distribu- 
tion (1). 

If, in equation (1), & is assigned the value zero, the distribu- 
tion becomes symmetrical and x = M=o0. Hence the condition 
for equal stability of median and arithmetic mean, 6; = , , may 
in this special case be written 


\/ [ 6, G 
6 *+  * aie eg 
z£,%, a4, % =z C,0n+ 6; 


Letting the ratio, Se , be denoted by » , we obtain 


0) fp)=<4pr2cgpr lara Ze 126 4 t4¢ 20, 


This fourth degree equation in / possesses two positive real roots, 
independent of ¢, and < , for {@) and f(co) are both positive, 
while fa) = (e,r4,)-E 21-E<o. 
Hence there exist two values, ¢</ and )/, such that when 
assumes either of these values the standard deviations of the arith- 
metic mean and median are equal. For values of in the interval 
(F<p{A.), the standard deviation of the arithmetic mean is less 
than that of the median. For values of ( outside this interval. 
the standard deviation of the arithmetic mean is greater than that 
of the median. Hence it is seen that, for @=0 , the relative stability 
of the median and arithmetic mean of distribution (1) is determined 
by the ratio of the standard deviations of the two component curves. 
Yule® has discussed the relative stability of the median and 
arithmetic mean of distribution (1) when, in addition to the con- 
dition &=0, the distribution is subjected to the further restriction 


4 = a * 05, 


*Yule, G.U., An Introduction to the Theory of Statistics, 8th ed., 
p. 339. 
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and has obtained the numerical values of / for which the two 
averages will possess equal standard deviations: 


f2 = 24472 , f2.* 2.2360. 


4. Relative Stability of Median and Mean for the Special Case, 4,24,. 


Now let the restriction @&=0 be removed. Let & assume 
any positive value, and let the condition | <, = <, >= 9.5, 
be imposed. The upper limits of the integrals in equation (5) will 
then be equal, whence 


£6, 


M-= samara x = & 
T+ oy 2 
outils 
2(6,+¢,)* = 
go « S=. g, 9 & mere C _ V2 425 4+€" 
” Vs (6+6) x 2vs 7 


The relative magnitude of the median and mean is seen to 
depend upon the standard deviations of the component distributions, 
and X is greater than, equal to, or less than ™ according as 
p= Oa/ G, is greater than, equal to, or less than unity. 

To obtain conditions for equal stability in the two averages, 
let 6;- be set equal to 9,, . By introducing the notation, 

p:%/e, k= Yor), 


this equation may be reduced to the form 


(10) (hr2)e Ghee + lok 4-87 *)o't (48's 4)p + (42) 0, 


Taking A=(P+%) as a new variable, this equation may be 
written as the quadratic 


a 


» “ * 
(A442) A + Gh s4)A +4R-67e =0, 


whose roots, are both real for all values of #*. Furthermore, since 
re’ s 2(%+:) for all values of ¥ *, one of these roots is positive 
and greater than 2, and therefore has a value which A = (7+ //o) 
may assume, , 

Hence, for all values of #*(and therefore for all values of 
¢-) there exist two reciprocal values of 7 , ( f and 7 ), such 
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that when assumes either of these values the standard deviations 
of the arithmetic mean and median are equal. For values of pP in 
the interval ( 7 <<), the standard deviation of the arithmetic 
mean is less than that of the median, and for values of 2 not in 
this interval, the standard deviation of the arithmetic mean is 
greater than that of the median. 

Yule’s results show that when 6-0, (7294472 and 2=2,2360, 
and therefore that the mean and median are equally stable when 
the standard deviation of one component is approximately 2.25 
times that of the other. It remains to investigate the behavior of 
the interval (7 <7? <F. )as ¢ varies. 

Since it is the ratio of the standard deviations of the component 
curves, and not their actual values, which determines this interval, 
suppose the unit of measurement to be so chosen that (7 +¢,)=/, 
whence #=- 6, and 

hs ~2(#ey)raVzrel Wa) 

Then since re‘e%)> 2 for all values of &* , A may be shown 
to be a monotone increasing function of ¢*, and therefore a mono- 
tone increasing function of ¢ (positive). Since, furthermore, A 
is an increasing function of ( (when >= ), it appears that the 
interval (@ <°<A_) in which the standard deviation of the arith- 
metic mean is less than that of the median (i.e., the interval in 
which the mean is the more stable average), becomes larger as 
the size of & is increased. 

Summarizing, for the special case of distribution (1) in which 
£,>4,, (i.e., in which the areas of the two component normal 
curves are equal), the relative stability of the median and mean 
depends upon the value of / , the ratio of the standard deviations 
of the two component curves, and upon the value of #, the distance 
between their means. When / equals one, the mean is the more 
stable average, independent of &. Furthermore, for all positive 
values of € there exists an interval of values of 7, including 
f= 17 , within which the mean is more stable, at the end points of 
which the averages are equally stable, and without which the median 
is more stable. When ¢-0, this interval is (.4472 <P < %.2360 ), 
and as & increases the interval becomes larger. 
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It was stated at the beginning of this section that, on account 
of the approximation to the value of 9, which is used, the conclu- 
sions will apply to distributions containing a large number of items. 
It should be noted that, in the special case which has just been 
considered, to assign a large value to € may cause the median 
to fall at a point of relatively small frequency, in which case, as 
will be shown in section IV, the approximation to the standard 
deviation of the median, Ong = Y (2 U4 vs) , will exceed its true 
value, and the superior stability of the arithmetic mean, as obtained 
from equation (10), may be exaggerated. In such cases the probable 
errors of the averages should be computed by the method of sec- 
tion V. 

5. Relative Stability of Median and Mean for the General Dis- 

tribution (1). 

Finally, let the restriction -< =¢, be removed from distribu- 
tion (1). The condition that the standard deviations of the median 
and mean of this general distribution shall be equal may be written 


7 G, Oz 
2 z 2 T ¢ Va 
4, 6 +2,6 +466 -|3 a <n) 
i i e 2a 


Let the notation 


Le 


|x 


Gy 2 z= 
P= G@, F#VTG , g-e me 7% 


4 


> 
be introduced. The parameters 7 ,#, #21 may thus assume only 
positive values, and g and 7 are not greater than unity. Then 
fy) c's. ge +(1 ZC, ag ™ tog apt (G'm'+26,°6,'¢ » hcg =z) yo" 
+(4 4 mhz 6% gm)P re Cm’ = O. 
"This equation | may possess two real positive roots. As / ap- 
proaches zero or positive infinity, it is seen that f(¢) becomes pos- 
itive, independent of c, and <, , and therefore that the standard 
deviation of the mean exceeds that of the median. If the equation 
possesses two distinct positive roots, there will be an interval of 
positive values of / for which the standard deviation of the median 
will exceed that of the mean. However, this interval does not 


necessarily contain the value, “= / , as in the special case where 
C, = c,: 
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III. 
THe RELATIVE STABILITY OF THE MEDIAN AND ARITHMETIC 
MEAN OF A SYMMETRICAL FREQUENCY DISTRIBUTION WHICH 
Is CoMPOSED OF THREE NoRMAL DISTRIBUTIONS. 


1. The Relative Magnitude of the Standard Deviations of the 

Median and Mean. 

It will be the purpose of this section to investigate the relative 
stability of the median and arithmetic mean of a symmetrical 
frequency distribution which may be dissected into three normal 
distributions, two of which possess equal areas and equal standard 
deviations and whose means are translated equal distances to 
left and right, respectively, of the mean of the third distribution. 

The equation of the frequency function which describes such 
a distribution is of the form 


e aa - cxte)> 


See 
ag* 


x 

- sa." Zz = ® 

(1) « sie : 2 e ag, 
o * aim ~ + oie ( = ), 


where the areas of the components are connected by the relation 
<,t2cge=/, 

Since the distribution is symmetrical with respect to the y- 
axis, both the median and mean fall at the origin, and, if the 
approximation, G, = i 24m VS ) , is used, the standard deviations 
of the median and mean are readily expressed in terms of the 
parameters of equation (1): 


cr 


™ 


To obtain conditions under which the two averages will be 
equally stable, let the notation, 
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be employed and let the standard deviations of median and mean 

be set equal to each other, whence we obtain the equation 
prreeemenmnngaionmmapanianeiiatigy ? V2 
1+t24,[PC4+r*)-1] = 


a oe ae ee 
2[P+r2¢6, le #_,)| 


which, if we let 


Ce Ea)eg and [plirn)-1]-h, 


may be written 


(2) fle): gh ot 8g (zpkhtg)e'+4-(Ph+29)<,+7 (2-7) = 0, 

Since 4+2¢-/ , only the positive real roots of equation 
(2) which are less than 0.5 are of interest. Independent of the 
positive value assigned to “ and 7 , this equation may have not 
more than two real roots in this interval, for both Ff (0) and 
f (0.5) are less than zero when #0. 

If equation (2) possesses two real, distinct, positive roots 
less than -<,= 4 , then there will be a subinterval of the interval 
(o<-¢,<¢.5 )within which the standard deviation of the arith- 
metic mean will exceed the standard deviation of the median, 
at the end points of which the averages will be equally stable, 
and without which the standard deviation of the median will 
exceed that of the arithmetic mean. If the equation has no real 
roots in this interval, the standard deviation of the median will 
exceed that of the arithmetic mean throughout the interval. 

The tangents to the curve whose equation is (2) are hori- 
zontal when C= -f/Ag and when <= (PR-29) fo gh, - Since 
FCF/2g) is negative for all values of other than zero, -//2g 
is a value of -<_ for which the standard deviation of the median 
is greater than the standard deviation of the arithmetic mean. 
If there exists an interval of values of <, for which the standard 
deviation of the arithmetic mean is greater than the standard 
deviation of the median, it will contain the value <- (A -2g)Jogh. 
Therefore the condition that such an interval exist is 


Ph -2 -Ph-2g 
0< te fast > 9. 
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If the value, ¢,-(7h-29)fogh lies in the interval (0¢c,¢0.5) 
and if f Crh-2g Veoh, = 0 , then equation (2) will possess a 
double root, and the standard deviation of the median will equal 
the standard deviation of the arithmetic mean for a single value 
of ¢,, ¢,=Cek-29)/¢g% , and will exceed it for all other values 
of c, . The condition under which f (-Ph-29)/g% will vanish 
is that 3/pt assume one of the values: 4.67284, —1.53327, 
—0.13957, for letting < = Cek -2g)ogk , equation (2) becomes 


(3) 8(9-pk)- 277 k'g =O, 


which may be written as a cubic equation in Yok. whose roots 
have the above values. 


2. The Dissection of a Frequency Distribution into Three Nor- 
mal Components. 

In order to apply the above conclusions in determining the 
relative stability of the averages of a particular sequence of eco- 
nomic data, it is necessary that the data be dissected into three 
normal distributions. A general method of determining the values 
of the five parameters of equation (1) from given frequency 
data will therefore be developed. 

Karl Pearson® has described a method for dissecting an 
asymmetrical frequency curve into two normal curves. He obtains 
expressions for the first five moments of the curve, which he 
solves, after lengthy algebraic manipulation, for the parameters. 
A similar procedure, the solution of moment equations, may be 
applied to a dissection into three normal curves. However, since 
the distribution has been assumed to be symmetrical, expressions 
for the odd moments vanish identically. Hence it is necessary 
to use moments as high as the eighth in order to obtain five equa- 
tions from which the values of the parameters may be determined. 
While Pearson’s method of setting up the moment equations may 
be used, his method of solution will not carry over to this case. 


* Loc. cit. 
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Given a frequency distribution of the variable x whose origin 
has been chosen at the arithmetic mean of the distribution, let 
™, denote the & ** moment of the distribution, and let Mg be 
set equal to the corresponding moment of the theoretical distribu- 
‘tion whose equation is (1). We have, then, as the equations 
from which the five parameters of distribution (1) may be deter- 


mined: 


(4) 62+206,-/ 
£,64+2£,(G + &*)= M, 
34, 0,"¢ 24, (36+ 666+ $')= My 
15,0 +26, (15 G+ 45 €°6,'+ 15 6g + ¢°) = M, 


105 ©, 7 +20, (1056, + 420 6°6" +210 b's" + 296 +t )=!,. 

Instead of carrying through the solution of these five equa- 
tions, it has been found convenient to assign to 6, a value equal 
to the standard deviation of a central group of items, and to 
retain only the first four moment equations to be solved for the 
other four parameters. Later the five equations will be used to 
correct this estimated value of ¢, . 

If we let M, denote the €** moment of the given distribu- 
tion with o as unit, denote Pg by u, “of, by / , and elim- 


inate ¢, from these equations, we obtain 


(S) x, +C1-¢,) 2 Cu’) = /4, 
3c, + (1-«,)e"(346u'4+K")= uy 


| : 
152, 4C1-6,) PO UStHH5urt15u7+ ue’) = % - 


Now eliminating C, , this system of equations reduces to 
3f tp Gew)] + O- Mela r6u's a!) = My [1-°Uire)] 


s[M-r (eu) + C-H)eo (15 45.7151 a) = /4, [1-74 u*)] , 
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and when the notation «: (I-M;) f=(M,-3), ¥=GM.-;)=-3¥ 8, 


d= (M-'5), €= (15 Mi-M;) = -15~-5, 
is introduced and the equations are written in descending powers 
of ~* they become 

PY Grourru")+73 (iru )ry = 0, 
(9) foe (15 +45u'4154 4 U*) +25 (1+ 2?) +E=0, 

Let Sylvester’s method of elimination be applied to these 
two equations, making use of the property that the resultant, 
RK , of the equations a,x’+ @, x" +a, x ta-0 and &x+6x+6 ~0 
is (a,&%) (a6&)-a,& -a,& 
R= (a, €) Ca, t,) -a, 

€ £, &. 


’ 2 


? 


where (@, ¢,) denotes 2.6 -2, &. 10 


Let (1+u*)=f lu’), (3+ousul)>f$(u") C5t45u¢15us uw) =f, Ca’*). 
ayf£-roff -38f-ref Bef, 
“ASS, YL -~Sf -wep] -0, 
«f, f, Y 
and expanding and simplifying this determinant we obtain 
(3x/7ve -avdy DFFF + ys + 3 se )F f° 
+(3 dy -/3¢) 47, + ry f+ vet? = 0, 
Since ¥, 3,y,4d, € are constants, this equation may be written 
ere 3 ,3 
ALLL + BEG + CLS, + DL EF = 0. 
If, finally, f, f, , and f, are replaced by their values as functions 


of “’, this equation reduces to 


(7) u"(a+B4CsDt£E) + (22h t14B+15C 4 30Dt18 FE) 
ur (159A 467B4+93C +315 DENTE) 
+ a* CHOSA +132/34196C 4/350 Dt 324E) 
+a 555A +123 /F4195C t2475D435/E) 
+ i? (270A 454 +90C +/350D 4162E) 


t (454A + 95 415C + 225D427E) = O,7 


* Dickson, L.E., First Course in Theory of Equations, p. 150. 
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a sixth degree equation in uw” upon which the complete solution 
of the problem now turns, for having obtained a value of u* 
from equation (7), the values of a and <, may be determined 
from equations (6) and (5), respectively. Since w= Ye, 
p= %/s , and ¢r2¢,-/ , the parameters of equation (1) 
may be obtained. 

It will be recalled that equation (7) has been obtained from 
the first four of equations (4), and that we have employed this 
equation to obtain values of c,, <4, G , & corresponding to an 
assigned value of «© . This estimated value of 6, may be cor- 
rected, and corresponding corrections to the values of the other 
four parameters may be obtained, by use of the five equations (4). 

Let equations (4) be written 

GG ,444% Fo G, €) = ace (6-423%45), 
Let c’,<’, o,’, &’ denote the values which the four parameters 
take on when g, is assigned the value a. Let ac, ,4¢,,46, 
a6, , 4& denote the respective corrections which should be ap- 
plied. Then using Taylor’s theorem and neglecting terms which 
contain derivatives of higher order than the first, we obtain five 
linear equations in the five corrections: 


£CE "2 <,, g,, &, ¢)= ft (4'+26, , S446, T4+AG, +26, é+a¢) 


? 


£65415, 6,6 )+a¢ 5 ab +6, Hog 3h $06 4p nG Se si ‘fhe. 
oo. 


The corrected ane a the parameters, iia » G+ 4¢,, 
G'+a6, » G+ag, , €'+26 may be regarded as second approx- 
imations to their true values, and further approximations may 
be obtained in the same fashion. 


IV. 
THE STANDARD DEVIATION OF THE MEDIANS OF SMALL SAMPLES. 


1. The Classical Approximation to the Standard Deviation of 
the Median. 
In the preceding sections an approximation to the standard 
deviation of the median has been used, and the conclusions have 
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been assumed to be valid only whens, the number of items in 
the sample, is large. We wish, in the present section, to examine 
this approximation, and to compare the results which it produces 
with those obtained when other methods of determining the 
standard deviation of the median are employed. 

The formula ordinarily used to compute the standard devia- 
tion of the medians of samples of S items each, drawn from a 
frequency distribution whose equation is y-f¢x) and which sat- 
isfies the condition 


(1) JS fe dz = 0,5 7 [fq dx 
is i 

} (”) 
(2) ‘* 


M 2.foyvs ° 

That this formula gives only an approximation to the true 
value of the standard deviation of the median and that the ap- 
proximation may be rather poor for distributions of certain types 
is clear from the following derivation of the formula. 

Let samples containing s items each be drawn from the 
distribution 4: f(x) which satisfies condition (1). Let the pro- 
portion of items above x-o in each sample be denoted by 
(0.5+2). These observed values will tend to cluster around 


y a t 
0.5 as a mean, with a standard deviation of . Let. the 


2Vs" 
deviation of the median of a sample from the median of the 
theoretical distribution, x-o , be denoted bye. Then if the 
number of items in the sample is sufficiently large to justify us 
in assuming that d is so small that we may regard the element 
of the frequency curve whose base is the interval ( 0,e), and 


whose area isd, as approximately a rectangle, we may write 


ae whence GO, = a ; 
$(@)? My $00) 2.fej:~s — 


“ Rietz, H.L., Mathematical Statistics, Carus Monograph III, p. 134. 
Yule, G. U., Introduction to the Theory of Statistics, 8th ed., p. 337. 
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This replacement of an element of a frequency curve by a 
rectangle can be justified only whene, the deviation of the 
median of the sample from the median of the theoretical distribu- 
tion, is small. Hence there is reason to doubt whether formula 
(2) will give a close approximation to the value of the standard 
deviation.of the median of samples which do not contain a large 
number of items. The formula would seem particularly untrust- 
worthy when applied to a theoretical distribution in which the 
median falls at a point of relatively small frequency. 

An expression for the standard deviation of the median 
which is not liable to the inaccuracies of approximation (2) may 
be derived as follows. Given the frequency function y= cx) 
which satisfies condition (1), if a sample of (27+, ) items is 
drawn from this distribution, the probal‘lity that an item will 
fall in the interval (x , x+4x) approaches the limit ydx as dx 
approaches zero, the probability that an item will fall below x 
is.,J $x) dx , and the probability that an item will fall above 
x is S fen) ax . Hence the limit, as dx approaches zero, of 
the probability that the median of the sample will fall in the 
interval (x , x+dx ) is 


cnt, [ {herd} [ [Fe ax] fo) dx, 


and the square of the standard deviation of the median may be 
obtained from the equation 


(3) 0, = Grr) C. fefos[froa} [ [foo dx ii 


The integrations involved in this equation may be difficult — 
to perform unless f@x) is a simple function. Hence, we consider 
the rectangular distribution whose equations are 


fiy-1, CeExS45);  f&jeo, (x¢-8), (x >#), 
and obtain 
Oi nn 
o, . Camp) C fe G2s-£) Ax = 


, 
4(ant3) , 
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If we denote by Cy, the approximation to the value of the 


standard deviation of the median obtained using formula (2), 
. coe F / 
we have for this distribution 6, = ——————— 


' 
” 2:-f0):V2n+1 : 2 V2nt f 


whence we have the relation o -o [zur 
” ” 2ant3 


It is observed that the approximation, Gs exceeds the true 
‘value, On , for all values of 77, but thaf the error factor approaches 
unity as 72 increases, and is close to unity even for fairly small 


values of n. 


2. A General Method of Obtaining Upper and Lower Limits 
of the Standard Deviation of the Median. 

For distributions composed of two normal components the 
integrations involved in equation (3) can be performed only ap- 
proximately, and this equation will serve only to determine upper 
and lower limits of the true value of the standard deviation of 
the median. A more straightforward method of obtaining these 
upper and lower limits, and one which is applicable to any 
frequency distribution, will be followed. 

Let x; denote the deviation of the ;* percentile of a dis- 
tribution from the median, and let , denote the probability that 
the median of a sample of s items will fall between the ¢% 
and (c+, y® percentiles of the distribution from which the sam- 
ples are drawn. Then a lower limit of the standard deviation of 
the medians of samples containing s items drawn from this dis- 


tribution is given by the expression 

wy 97 tf 

= %%,* 2a KI, 
a*5/ 


arr 


and an upper limit, by the expression / 
Ta 


49 100 
Z xin + = xtz,], 
70 42 5/ 


where, in the case of a distribution in which the zeroth or- hun- 
dredth percentile is at an infinite distance from the median, 
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X, denotes the largest value of x for which it is true that 
at # (x)dx<€ , and x, denotes the smallest value of x for 
which it is true that, Sfx) dx<eé , where € is an arbitrarily 
small positive constant. 

The values of ~, depend on the distribution, and are inde- 
pendent of the number of items in the sample. The values of 
fe depend on the number of items in the sample and are in- 
dependent of the form of the distribution. Approximations to 
the values of % , (¢=442:---, 797), may be obtained by use 
of the DeMoivre-Laplace theorem'*. In our notation the theorem 
may be stated: 

The probability that 7 or more of the items of a sample 
containing (277-/) items will fall to the right of the ¢ th per- 


centile of the distribution from which the sample is drawn is 


2 


oo _2 | 
P- fre ax where z= 2 larm-)0-0le)- 2.5 _ 
© J Var 7 ——— 
. m7-1) (01 c)Ct-.oe€ 
Then 4.-F-F,,- 


Tables'* of values of ~, for samples containing 7 and 51 
items have been computed, and have been used in calculating 
upper and lower limits of the standard deviation of the medians 
of samples containing 7 and 51 items drawn from the distributions 


whose equations are 
2 


J -+ 
f, ) = yr © 
_ Gray Gena)” 
2 = 
heats salerfe “se ~ }, 


2277 


2 


7 (4/3 x) a Gx) 


foe Gale tre?) 


* Rietz, H.L., Mathematical Statistics, Carus Monograph III, p. 35. 
” These tables are included in the author’s dissertation, which is filed 
in the brary of the University cf Wisconsin. 
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Approximations to the value of the standard deviation of the 


median have also been obtained using formula (2). The results 
are tabulated below. 


STANDARD DEVIATION oF THE MEDIAN 


‘7 ttems. 5/ etems. 


Upper Lower Forinula Upper Lower Formula 
Limit Limit (2) Limit Limit (2) 


We conclude that, when applied to samples containing a 

fairly small number of items, the results obtained using the cus- 
tomary formula for the standard deviation of the median may 
be very untrustworthy, particularly for a distribution in which 
the median falls at a point of relatively small frequency, 
We therefore shall propose another method for the comparison 
of the stability of the arithmetic mean and median, one which 
does not involve the computation of the standard deviation of 
these averages. 


V. 


THe RELATIVE STABILITY OF THE MEDIAN AND ARITHMETIC 
MEAN, DETERMINED FROM THE FREQUENCY DISTRIBUTIONS 
oF THESE AVERAGES. 


1. The Frequency Distributions of the Median and Mean. 
Since the true value of the standard deviation of medians of 
samples containing items, drawn from a frequency distribution 
which is composed of two normal distributions, is not easily 
determinable, and since the customary approximation is not suffi- 
ciently accurate to justify its use in the study of small samples 
drawn from a distribution of this type, we shall develop a method 
of comparing the relative stability of the median and arithmetic 
mean, based not on the standard deviations of these two averages 


but on their frequency distributions. 
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Another consideration, aside from expediency, motivates the 
development of this method, for even if the standard deviations 
of the arithmetic mean and median could be accurately computed, 
they would not determine the relative stability of the two averages 
unless it is assumed that the frequencies of the mean and median 
are distributed in the same fashion. If, however, the equations 
of the frequency curves of the mean and median of samples of 
§ items drawn from a given distribution are determined, then 
by comparing the deviations from the median of corresponding 
percentiles of these two averages, a judgment as to the relative 
stability of the two averages may be formed. 

We shall assume the frequency curve of the arithmetic means 
of samples of (27+/ ) items to be normal, independent of the 
form of the theoretical distribution from which the samples are 
drawn, and to possess a standard deviation of N27 ; 
where 6 is the standard deviation of the theoretical distribu- 
tion.'* We proceed to determine the equation of the frequency 
curve of the medians of these samples. 


Let the equation of the original distribution be y=: f¢x), 


and let the condition f. fx) hat 4 ‘ ($e) dx 
-@ ‘o 


be satisfied. Then the probability that the median of a sample 
of (27+/) items will fall in the interval (x, x+td@x ) is the 
product of the probabilities that an item will fall in this interval 
and that of the remaining 2m items, 7 will fall above this interval 
and 71 below this interval. We let ¥,, denote the frequency func- 
tion according to which the medians of the samples are distributed, 
and obtain 


y= (an+1), 0 Cfo) | {Foo il [ {Fe a]: dx 
= C2741) C. fosle+ fhe {x ]. [-ffon a] 4x 
(1) = Cane), Ros fey}. a5- Entel) te (13) 


“Rietz, H.L.. Wathematical Statisties, Carus Monograph III, p. 127. 
* A similar expression for the probability density of the median is 
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2. The Stability of an Average Determined from Iis Probable 

Error. 

Expressions for the frequency functions of the median and 
mean of samples containing (277+/ ) items having been deter- 
mined, we may form a judgement as‘to the relative stability of 
these two averages for a given distribution by comparing the 
deviations, from the median of the given distribution, of cor- 
responding percentiles of the two averages. If some definite 
criterion of relative stability is desired, it seems natural to select 
the probable errors of the averages, where the term (probable 
error) is understood to have its original meaning, and not to 
denote a fixed multiple of the standard deviation of the average. 
We shall therefore proceed to determine the deviation, from the 
median, of a given percentile of the frequency distribution of 
medians of samples containing (27+, ) items drawn from the 


distribution whose equation is y= f(x) . 


found in a paper by E. L. Dodd (Functions of Measurements under General 
Laws of Error, Skandinavisk Aktuarictidskrift, 1922, p. 150), and is there 
used in comparing the relative stability of the median and arithmetic mean 
of certain theoretical frequency distributions. However, the method used 
in Dodd’s paper is to compare the probability densities of the two averages 
at the median of the original distribution, rather than to compare the 
deviations from the median of specific percentiles of the frequency curves 
of the two averages, as we shall do. Dodd uses Stirling’s formula to obtain 
an approximation to the probability density at the median, 


y (0) = J — fe), 


and represents the probability density of the arithmetic mean at the same 


point by the expression : NIT 
b,)* var 
where 6 is the standard deviation of the original distribution. 

It is readily seen that this method of comparison, when applied to 
small samples, would lead to exactly the same inaccuracies that would 
result if the relative stability of the two averages were determined by 
comparing their standard deviations, the customary approximation formula 
being used to obtain the value of the standard deviation of the median, since 


Inq (0) _ page flo) = V2 7741 _ = 2Vant! fo) _ oz 
Je J, @) o V2 V271 Ong! 


C 














250 RELATIVE STABILITY 


Let S denote that fraction of the area under the frequency 
curve of the medians which is bounded by ordinates drawn to 
the curve at the points *=>9 and x=2@. Our problem is to 
determine the value of & which corresponds to an assigned value 
of S , and from (1) the relationship between € and S$ is seen 
to be expressible in the form 


€ ~ — 
im S* (2741) GC, {fer} 2-L foo da] fax, 


where 5 may be assigned any value in the interval (o< S$ 2 5). 
If the transformation ‘a 
t= 2 J fix) dx 
be applied to equation (2) it becomes 


2 * 
(3) S- G7t/)an’n [o-e®) dt, where ¢ corresponds to é 
© | 


garner 
Then 
Ss awe ae “ i 
(4) 2 2 32 GY). fue) dt 
Gant) nC, Camtr)s o , ) 


antl 





» .% . 5 7nln-1)ln-2) 7 7 
wae Bale alma) ob me Nr) age 
2/.5 3! 7 znt/ 


and using Stirling’s approximation, we obtain 


x 
o Bee. fo-crn 








(amt) 
antl 
i 5 - - 
ous Be» SO . BietieelL... « =. 
3 2!.5 3!. 7 antl 


It is observed from equation (3) that, for a fixed value of 
7, & iS a monotone increasing function of S , and that y=° 
when S=:o, and x=/ when S-o0.5 . It is also observed that, 


for a fixed value of S , ¥ is a monotone decreasing function of 
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7 , and that — ox =O. 

Unless S is assigned a value near 0.5, we may obtain an 
approximation to the value of » from equation (5) by neglecting 
terms containing powers of e higher than the third. We wish 
to determine the degree of approximation which is introduced 
by dropping terms after the second from the second member of 
equation (5). To this end, we shall first ascertain an interval 
of values of S for which it is true that w is not greater than 
the simple, decreasing function of 77, “Sm ; that is, we shall 
determine the interval of values of S which satisfy the in- 


, / 


 ,mn-1) _1(n-iXn-2) ein 


U / 
2!5 win 3/1 7 2 Gis ane)n Vn ” 


3 glsn 3! 7H Canti)n 


i¢ Sf _ Cr-in-2) t / | 
” . 


The second member of this inequality is greater than 
#e[-4+H - (Ma), .... t ———, | 
Vir 3 2!5n 3/ 77m (27741) 77 4? 
and since the terms of the finite alternating series within paren- 
theses obviously decrease in numerical value, their sum will exceed 
2/3 for all positive values of 7. Therefore the inequality 


< — 


eS Ve 
will certainly be satisfied for all values of S in the interval 
Is] $ se = 0.376/, 

and therefore o is not greater than ‘'/¥>_ when S is assigned 
a value corresponding to a percentile of the frequency distribution 
of the medians between the 13th and 87th percentiles. Certainly 
in determining the first and third quartiles of the frequency dis- 
tribution of the medians, x will be less than //V7z . 

Since in equation (5) the value of S is given by a finite 
alternating series whose terms do not increase in numerical value, 
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the error involved in neglecting terms after the second will be 
less than the first term neglected: 


ant) 7707-1) 
2vVT7m 2/5 


But, when |S]< 0.37G/, it is true that 


27+! nor) 5 < 2zmtt mOr-t) | _  27-7-/ 


2VTn 2/5  2Vrm 215 n*¥n 20 WV 

ae (1-35 ~ 5m) <a a er 
We conclude, therefore, that the value of ¢ obtained from equa- 
tion (5) by neglecting powers of Y higher than the third cor- 
responds to a percentile of the frequency distribution of the 
median which differs from the assigned value of S_ by not more 
than 0.05. 

We see, then, that an approximation to the value,¢ , of a 
given percentile of the frequency distribution of the medians of 
samples containing (277+/ ) items drawn from the theoretical 
distribution whose equation is ¥ - f(x) may be obtained by solv- 
ing for > the third degree equation 


2Svrn 


Z77+/ 


3 
? 


7 
Ss. = 3 
4 
where Y = 2 [ fo dx. 

o 


(6) 


The tables of values of #2; mentioned in the preceding sec- 
tion afford a check on the accuracy of the results of equation 
(6) when (277#/ ) is assigned the values 7 and 51. From these 
tables it is observed that the third quartile of the frequency dis- 
tribution of the medians of samples containing 7 items falls at 
the 62nd percentile of the theoretical distribution from which 
the samples are drawn, and that for samples containing 51 items 
the third quartile of the medians falls near the 55th percentile 
of the original distribution. Hence, when S = 0.25 , the value of 


x . - 
{ $02) ax , accurate to two places of decimals, is 0.12 when 
°o 
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(277+! ) equals 7, and 0.05 when (277+/) equals 51. 
In equation (6) let 
2SVrn , 


27t/ 


whence we obtain 


Bare tk =O. 


Letting «=k + A , this equation becomes 


(Ke psn*rA ¢3KX+X)- A=, 
or as an approximation 
PP eo 
a as 
t-ak* 
Assigning to 7 the values 7 and 51 we obtain the following 
results: 


= x 
antl K A oY Jfixdx Computed Value of [fxd x 


7 ~=.2193 = .0123 2316 = .1158 12 
51 0869 0072 .0941 0471 05 


Thus we have developed a method of determining the prob- 
able error (or any percentile) of the median, which possesses the 
double advantage of being applicable to distributions which do 
not contain a very large number of items, and of being applied 
easily to any distribution, for after (6) has been used to determine 
the value of o , the corresponding value of £ may be obtained 
either from a table of integrals of a theoretical frequency func- 


tion, or from an actual distribution by cumulating frequencies 
beyond the median. 


The calculation of the probable error (or any percentile) 
of the arithmetic mean offers no difficulty if we assume the means 
of samples to be normally distributed. The relative stability of 
the two averages may then be determined by comparing the 
probable errors (or corresponding percentiles) of the two aver- 
ages. This method of comparison will be applied to a particular 
distribution in section VI of this paper. 
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A table’® of values of * and « for certain assigned values 
of x and S is given below. 


295 VT 


271t/ 


Values of ¥ and k when xk = 


° {cr-t*)” at 


.027 
054 
082 
111 


017377 017 

034754 035 

052131 053 

25 ‘ 069508 .073 
‘ -104262 .116 

121639 143 

139016 -176 

156393 223 


50 


100 





* Computed by Miss Beatrice Berberich, university computer, Univer- 
sity of Wisconsin. 
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VI. 
AN APPLICATION TO A PARTICULAR SEQUENCE OF ECONOMIC 
DaTA OF Various METHODS OF COMPARING THE STABILITY 
OF THE ARITHMETIC MEAN AND MEDIAN. 


1. Dissection into a Symmetrical Distribution Composed of Two 
Normal Distributions. 


In a paper by W. L. Crum" a particular sequence of economic 
data has been examined with the purpose of determining the 
relative stability of its median and arithmetic mean. The series 
studied comprises the monthly link relatives of the rate of interest 
on 60-90 day commercial paper from January, 1890, to January, 
1917. A frequency distribution of deviations from their medians 
of the link relatives for each month is reproduced below, together 
with the values of the first six moments of the distribution. 


FREQUENCIES OF DEVIATIONS FROM THE MEDIANS 


1 
1 
1 
1 
1 
1 
1 
1 
2 
1 
2 
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Professor Crum’s method of attack is to dissect the series, 
according to Pearson’s method, into two normal components 
whose means are coincident. He therefore fits to the data a curve 





whose equation is ai _ = 
1 meliate + Se “<*) 
uf = = ay 
(1) I* Var NG v, ? 


and obtains for the parameters the values: 


4, =.26, J, = 2.46, 4,= .74, G-/49 
This theoretical distribution is of the type discussed in paragraph 


3, section JI. Its median and mean will be equally stable if 
p= af, satisfies the equation 


2 ¥ 3 3 3 > 
¢, 4f tZG cP r(<, +¢-Z)e #26°6 Pr 6 c “? 


Letting c,=0,25 and <,=0,75, this equation reduces to 


4 3 = 7 
(?°+b6P -24P7 +24 3Z= 09, 
which has a root between 2.5 and 2.6. Since for the distribution 
under consideration / = 4% , the standard deviation of the arith- 


metic mean is larger than the standard deviation of the median, 
and the median is the more stable average. 


2. Dissection into an Asymmetrical Distribution Composed of 

Two Normal Distributions. 

In the method of dissection employed by Professor Crum, 
the slight positive skewness which the distribution possesses is 
ignored. We shall dissect the data into two normal components 
whose means are not equal, and investigate the relative stability 
of the median and mean of the resulting asymmetrical distribution : 








cx -6,)* cx- & > 
mein * =. i 
(2) a Var &G G S 


Pearson's method of dissecting an asymmetrical distribution 
depends on the solution of his “fundamental nonic,” 


24-282, be +36.urp (24, a,- os 2) 4° C484, A, +2 AS)" 


+ (283x"02 AA, 4, - 2) pe +4, 2, 11, Mi) +32 Ap ~ 24,0, 
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in which 4, denotes the ¢ +h moment of the given distribution, 


and Ay= Vat, - 34t,, A, = 3042, 44, - 34L,. 

A value of #, having been obtained from this equation, the 
parameters of equation (2) are determined by solving, succes- 
sively, the equations 


3 2 3 
en, - 244, A, 2, - ASM, — B44, 7, 


2 


4A; dy het 2p 
3/4, ? 


(the two roots of this equation are 
denoted & , €, ) 


? 


The calculation of the Sturm’s functions of the fundamental 
nonic shows it to have three real roots, two between 0 and —100, 
and a third between 200 and 300. The values of these roots are 
found to be 

fn -F 5517, ~1h 6140, 210. 
However, the use of the second and third of these roots leads to 
imaginary values of certain of the parameters of equation (2), 
and they are therefore rejected. Using the root, #,:-5.55/7 , 
the parameters of equation (2) are found to have the following 
values : 


£ = 9.9637, & = -0,46, > 7.02 


£, = 0.9363, & - 12.20, og, 25.67 . 
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3. Dissection into a Symmetrical Distribution Composed of Three 
Normal Distributions. 
Finally, let the given data be fitted by a frequency curve 
whose equation is 
(x6) | (x46) 


(3) y- i —— oc ag - 


where the origin is selected at the arithmetic mean of the original 
series. The dissection depends on the solution of equation (7), 
section III, which for the distribution under consideration has 
the form 








12 10 8 o rv > 
UL - 58.355 iL - 230,538 UL - 243,922 U -GOISY LL ~3,/04U - 0,527 = O. 


The only positive root of this equation is w= 62./3, 
Solving, successively, equations (6) and (5) of section III, the 


values of the parameters of equation (3) are found to be 
4: 0.96, T= 7.64, = 9.02, T= 467, & = 36.95, 


The accompanying figure shows the original distribution 
(grouped into class intervals of five units) and the curves obtained 
by each of the three methods of dissection, plotted on the same 
set of axes. It appears from the figure that a distribution of 
type (3) fits the data more closely than either of the other curves. 
This fact may be checked by comparing the sums of the squares 
of the differences between the actual and theoretical frequency 
of each class. The values of these sums of squares of deviations 
from theoretical distributions (1), (2), (3) are found to be, 
respectively, 68.250, 51.604, 19.435. 


4.—Relative Stability-of Median and Mean for Each of the Meth- 
ods of Dissection. 


We turn now to the problem of comparing the stability of 
the median and mean of the three theoretical frequency functions 


» 
v 
= 
y 
Y 
gs 
% 
& 


() 
(2) 
(3) 


FREQUENCY OF © 
GROUPED OBSERVATIONS 
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obtained by dissection. Since +324 is fairly large, and since 
the median of each of the theoretical distributions is located at 
a point of relatively large frequency, we shall use the approx- 


imation to the standard deviation of the median, 
CC « «sees 
™ 2°44 VS ? 
where y,, is the ordinate at the median. 
For Crum’s dissection into two normal distributions, the 


arithmetic mean and median are both at the origin. Hence 


/ 2 
. =-=—— V4 95 +£G" = 0.57, 
324 
Dy = i = 0.%Z, 
2 V3! (+ = | 
Vir ‘6° G 


which verifies his conclusion that the median is more stable than 
the arithmetic mean. 
For the asymmetrical dissection into two normal curves, 


X= 46+ 046 
oe / 


o_ 
x V324 


/] is a root of the equation 





and its value, obtained by interpolation in a table of values of 
the integral fe St , is 4-0/6. Hence 
! = 0.64, 


mm =a _ 6-1 
ave2t (Se 2a, Seo x ) 
vir \6, . 





For this method of dissection the arithmetic mean is more stable 
than the median. This was to be expected, since the second com 
ponent contains so small a fraction of the total area that the 
compound curve differs little from a single normal curve. The 


HARRY S. POLLARD 261 


figure shows that this curve would not naturally be chosen to 
represent the given data. 
For the symmetrical three normal curve dissection, 


M=X=0, 


z » 
4G +2C, 0 +2c€¢* = 0.59, 


V2T .G, Os 


0. = 0.55 
M : ~ 67 5) ° 
V3244 -2(4,0, +245 20) 


This method of dissection bears out Crum’s conclusion that the 
median of the original series is a more stable average than the 
arithmetic mean, although the difference between the standard 
deviations of the two averages obtained by this method is con- 
siderably smaller than that obtained by the first method of dis- 
section. 


5. The Probable Errors of the Median and Mean, Determined 
from the Frequency Distributions of these Averages. 

The above discussion has been concerned with certain the- 
oretical frequency curves, rather than with the actual data which 
these curves are intended to fit. We shall now compare the 
relative stability of the mean and median by the method of section 
V, which does not involve a fitting to the data of a theoretical 
frequency curve. 

The method of determining the quartiles of the frequency 
distribution of the median, developed in section V, assumed the 
number of items in the sample to be odd. We therefore solve 
the equations 


3 

4Vrn 2 =x 
Y= _ CC 
2ntI [-7Ke 


using the values 323 and 325 for (27+/ ) and obtain roots 


K= 0.0348, A=d.0027 
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in each case, whence y= Kt A 29,0375, 
x 
and f foyodx 2 0.01 FF. 
o 


For the given distribution, the median falls at zero deviation. 
The fourteen items in the upper half of the zero class comprise 
4.32% of the entire frequency distribution. Hence the third 


quartile of the distribution of the medians has a value 


0018 € 


meen & atte. 
0.0432 


Similar reasoning shows the value of the first quartile of the 
medians to be—O.2176, whence the semi-interquartile range is 
0.2176. 

Since oO: 106.85, the probable error of the arithmetic 


mean has a value 


6745 vo = 0.3845, 


Thus the median is again shown to be more stable than the 
arithmetic mean. 
Harry S. PoLvarp, 
Miami University, 
Oxford, Ohio. 


Peary oh. Prlbecd 









* ; 
- 7 
F 
- 
= ~ . 
. 
es Se a idee i Ee de i a gi 









